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COMPARISONS 
AMONG TREATMENT 
MEANS IN AN 
ANALYSIS 
OF VARIANCE 


FOREWORD 


That the analysis of variance is a powerful technique for testing hypotheses has been accepted for many years. 
In analyzing a set of data, however, the scientist usually is interested in relationships between the means to which the 
analysis of variance is insensitive. 

As early as 1939, statisticians used techniques independent of the analysis of variance to compare means froma 
given experiment. Since the middle 1950’s, the interest and literature have increased almost exponentially. 

In May 1957, Biometrical Services issued ARS 20-3, Mean Separation by the Functional Analysis of Variance 
and Multiple Comparisons. This publication has been out of print for many years. Since the publication of ARS 20- 
3 much work has been done on the subject, indicating the need for a major revision. 

Since the job of coordinating the national aspects of statistical consulting in ARS was delegated to the Data 
Systems Application Division (DSAD), we asked Victor Chew, mathematical statistician, to revise ARS 20-3. We 
feel that he has done a very thorough job, which should put mean separation techniques in the appropriate field of 
reference with respect to other statistical techniques that may be used in drawing judgements from data. 

Copies of this publication may be obtained from Victor Chew, University of Florida, Room 217, Rolfs Hall, 
Gainesville, Florida 32611. 

Judson U. McGuire, Jr. 
Staff Specialist, DSAD—ARS 
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PREFACE 


The equality of the true average responses of two treatments (varieties, insecticides, concentrations, 
temperatures, etc.) usually is tested statistically by the Student’s t-test. This is generalized for t (three or 
more) treatments by the F-test or the analysis of variance. If the F-test rejects the hypothesis that the t 
treatment means are equal, the only conclusion is that the t means are not all equal. It does not necessarily 
follow that these t means are all unequal although this may well be true. The next stage in the data analysis is 
to determine which treatment means are different. Repeated application of the Student’s t-test to all possible 
pairs of treatment means (using pooled error either from all t samples or only from the two samples involved 
in the t-test) usually is discouraged since this procedure gives a large probability of getting one or more false 
positives (that is, of declaring two treatment means to be different, when they are, in fact, equal). Special 
techniques (called multiple comparison procedures) are available for this purpose. 

Uses and abuses of multiple comparison procedures are discussed in this publication. One glaring abuse 
is its use in comparing several levels of a quantitative factor (such as concentration, temperature, and pH). 
Regression analysis is the appropriate technique here. Equivalently, the treatment sum of squares in the 
analysis of variance table should be partitioned into linear, quadratic, etc., components. In comparing the 
effects of, say, 10, 20, 30, and 40 p/m of a certain chemical, if the regression of the response on concentration or 
if any component of the sum of squares for concentrations is significant, then no multiple comparison 
procedure is necessary. ALL concentrations are significantly different in their effects. In fact, not only will 10 
and 20 p/m be different, but so also will 10 and 10.1 p/m. The difference, of course, between the effects of 10 
p/m and 10.1 p/m will be extremely small. However, the usual statistical test of significance is not concerned 
with the magnitude of the difference, but only whether a true difference exists, no matter how small. 
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COMPARISONS AMONG TREATMENT MEANS 
IN AN ANALYSIS OF VARIANCE 


By Victor Chew! 
CHAPTER 1. INTRODUCTION 


Before embarking on an experimental project, the research scientist should carefully consider various 
issues. These issues include questions that the experiment hopefully will answer, the factors or variables to 
be controlled or kept constant during the experiment, the levels of the factors to be varied in the study, the 
number of observations to be taken, and the manner in which these observations will be grouped into blocks. 
We shall need fewer observations or have wider applicability of the results, or both, if the experiment is 
designed efficiently. 

This publication is concerned with a particular facet of the analysis of the experimental data, assuming 
that the experiment has been designed properly. It is applicable irrespective of the experimental design 
(completely randomized, randomized blocks, Latin square, split plot, etc.). We also shall assume that the 
reader is familiar with the computational aspects of the analysis of variance for these designs. 

The basic terms and notions in statistical inference will be reviewed in this chapter. This is necessary to 
understand the relative merits of multiple comparison procedures that are currently available. 

In the simplest hypothesis testing situation, we compare two treatments (varieties of peanuts, fertilizers, 
temperatures, pH, machine settings, etc.). If we denote the true means of the two treatments by yw, and pro, the 
statistical hypothesis to be tested is usually that these two means are equal (4,;=2). This hypothesis, called 
the null hypothesis, often is denoted by Hy. We write it as Ho: (u; — we) = 0. (We can test a more general 
hypothesis, viz., (uw; — 2) = d, where d is specified numerically.) 

In classical hypothesis testing, we must decide whether to accept or to reject Ho. (In sequential testing, 
we allow a third alternative of requiring more observations to be taken.) Because the true or population 
means 4, and py are unknown and unknowable, our decision from the statistical test (whether to accept or 
reject Hy) is subject to error. If y, and y. are the observed or sample means, estimating mw, and p, 
respectively, then because of nonhomogeneity of the experimental material (such as plants, animals, plots of 
land, batches of peanuts), failure to reproduce identical experimental conditions, errors of measurements, 
etc., y; and yp will be unequal, even if 1, and 2, are equal. In fact, we may even have y, larger than y, when 
actually «, is smaller than pz, especially from a small experiment. 


There are two kinds of error in hypothesis testing: 


Type I—Reject Hy when Hy is, in fact, true (i.e., erroneously deciding that , and pz are unequal). 
Type II—Accept Hy when Hy is, in fact, false (i.e., incorrectly deciding that mu, and p, are equal). 


The probabilities of a test making these errors usually are denoted by a and B, respectively. The perfect test 
is, of course, infallible (where a = B = 0), but this is impossible with a finite sample. A good experiment is one 
in which both a and B are small. The value of a is called the significance level of the test, sometimes expressed 
as apercentage. By suitably choosing the rejection region or critical values for the test statistic, we can make 
a as small as we like, but only at the expense of increasing 8. For example, we can make a = 0 by always 
accepting Hy, regardless of the experimental data, but in this case B = 1. The only way to decrease both a and 
8 simultaneously is to increase the sample size (number of observations). Conventionally, a is taken to be 
equal to .05 or .01. With B defined as the probability of accepting Hy when Hy is false, (1 — 8) is the probability 


1 Mathematical statistician, Biometrical and Statistical Services, Agricultural Research Service, U.S. Department of Agriculture, 
217 Rolfs Hall, University of Florida, Gainesville, Fla. 32611 


of rejecting Hy when Hy is false. This quantity is called the power of the test—the probability of the test to 
detect a difference when one exists. There are infinitely many tests with the same value of a; among these, we 
choose the most powerful one (for which 8 is least) if one exists. 

If Hy is false, another alternative hypothesis (denoted by H,) is true. Corresponding to Ho: (wu; — M2) = 9, 
three possible alternative hypotheses are (4, — fs) > 0, (uy — Me) <0, and (uw; — Ms) = 0, called the right-tail, 
left-tail, and two-tail alternative hypotheses, respectively. If the first treatment is “control” (i.e., no 
treatment at all), the second treatment is the application of some insecticide, and the response being 
measured is the number of a particular insect per plant, we know a priori that the alternative to Ho:4; =z is 
H,.: “; > Me because the application of the insecticide cannot possibly increase the average count. By 
capitalizing on the one-sidedness of H,, we can construct a more powerful test of Ho, with the same a. If we 
are comparing two new insecticides, the alternative hypothesis is two-sided. 

It will be seen that a is associated with Hy and B with H,. This explains why we can control a but not B. 
We need the actual difference between the two means to control B. For this reason, experimenters too often 
ignore Type II errors. If they are only concerned with holding Type I errors down to 5%, they need not 
conduct the experiment at all. They merely need to take 20 index cards, mark one with an X, shuffle them 
thoroughly, and draw one card at random. Reject H, if the marked card is drawn. At a saving of hundreds if 
not thousands of dollars, this experimenter has only a 5% chance of making a Type I error. The reader should 
think about the value of B in this case. 

We cannot emphasize strongly enough the distinction between statistical and practical significance. Any 
difference between the sample means y, and yo, no matter how small, must be declared statistically 
significant if the population or true means p, and pw, are unequal, unless the test has committed a Type II 
error (incorrectly declaring two means equal). The test will declare the difference significant if we have 
enough replications. In calculating the number “n” of observations to be taken, we only should require n to be 
large enough so that the test will detect a difference of at least d (of practical significance) between w, and py. 
It is no big loss to declare incorrectly that 4, and 2 are equal if they differ by an insignificant amount. 

The author thinks that the research worker has been oversold on hypothesis testing. Just as no two peas 
ina pod are identical, no two treatment means will be exactly equal. They always will be different, even if only 
in the thousandth decimal place. It seems ridiculous, therefore, to test a hypothesis that we a priori know is 
almost certain to be false. If the test accepts the hypothesis of equal treatments, a Type IT error probably has 
occurred. A related but much more informative alternative approach is interval estimation of (uw, — M2). The 
confidence limits, of the form (y, — yz) + ¢, will tell us whether the null hypothesis will be accepted (if the 
limits have different signs) or rejected (if they have the same signs). They also will give the estimated 
magnitude of the actual difference. The value of ¢c depends, among other things, on the confidence level y. If y 
= 0.95, we have 95% confidence that (uw, — 2) is between (y,; — y2. — c) and (y, — y2 + ¢). The closer y is to 
unity, the wider the confidence interval. For a given y, we can shorten the interval by increasing the sample 
size. . 

The practice of hypothesis testing when comparing several treatments is even more difficult to justify. 
When comparing 10 new varieties of corn, for example, it is inconceivable that all the true average yields will 
be exactly equal. Besides a simultaneous confidence interval approach for all pairs of varieties, a better 
objective may be to select the smallest subgroup that has a preassigned probability (95%, say) of including the 
highest yielding variety. This subgroup of varieties may be tested more intensively and compared in a later 
experiment, as in the screening of new drugs. 


CHAPTER 2. PARTITIONING OF DEGREES OF FREEDOM FOR TREATMENTS 


This chapter deals with situations in which it is possible, before performing the experiment, to partition 
the degrees of freedom (d.f.) for treatments, either completely into single d.f. or partially into groups of d.f. 
Partitioning must not be suggested after examination of the experimental data. LeClerg (1957) 2 referred to 
this partitioning as “functional analysis of variance.” Use of a multiple comparison procedure in this chapter 
(with a couple of exceptions, explicitly stated) constitutes an abuse of the technique. If the difference between 


2 The year in parentheses following the author’s name refers to List of References, p. 59. 


the observed average responses of two treatments is statistically significant, we shall simply say that the two 
treatments are different. 

In this chapter, a significant F-test for treatments is not a prerequisite for the partitioning of the 
treatments d.f. or s.s. (sum of squares). In fact, the F-test need not and should not be carried out at all. In 
comparing t treatments, with (t — 1)d.f., the blanket or overall F-test for treatments is averaged over (t — 1) 
orthogonal comparisons (defined later). If only one or two of these comparisons (or contrasts) are significant, 
the overall F-test is diluted or weakened by the (t — 2) or (t — 3) nonsignificant contrasts and erroneously may 
give a nonsignificant F value. 


2.1 Orthogonal Contrasts 


Let yi, Yo,. - ., y,and T,, T,,. . ., T, be the sample means and totals from Treatments 1, 2,. . . , t, 
respectively. (Unless otherwise stated, we shall assume that the treatments are equally replicated. If nis the 
common number of replicates per treatment, we have y, = T;/n.) The expression Lay =(ajy, +. . . + ay,)is 


called a linear combination of the treatment means. A linear combination is called acomparison or acontrast 
if the coefficients (the a’s) add up to zero. For example, if we have t = 4 treatments, y, —(¥. + y; + ys) isa 
linear combination of the treatment means. It is not a contrast, however, since the sum of coefficients is 
nonzero. (It is equal to —2.) This linear combination compares the mean of the first treatment withthe swm of 
the means of the remaining three treatments, which is not a fair comparison according to the ordinary 
meaning of “fair.” A fair comparison is to compare y, with the average of the means of the remaining three 
treatments, given by y, —(¥2 + y3 + y,)/3, which is now also a contrast since the coefficients add up to zero. To 
avoid fractional coefficients, the preceding contrast usually is written 3y, — (VY. + V3 + ya). 


The sum of squares corresponding to a contrast C = Lay is 
g.s. (C) = n@ay)/(Sa2) = CaT)/Ina?)], (2.1) 


where >a? is the sum of the squares of the coefficients in the contrast. (Notice that the s.s. is unchanged if we 
multiply the coefficients by a constant.) Since a contrast has one d.f., the s.s. is also a mean square (m.s.) 
because (m.s.) = (s.s.)/(d.f.). It may be tested for significance by dividing it by the error m.s. (with m d.f., 
say) that normally would be used to make the overall test for treatments in the analysis of variance. The 
calculated ratio is compared with the critical value of the F-distribution with 1 and m d.f. 

If we are comparing t = 4 treatments in a completely randomized experiment with n = 8 replicates per 
treatment, the d.f. for the error m.s. is m = t(n — 1) = 8. In a 5% two-tail test, the critical value of the 
F-distribution with 1 and 8 d.f. is 5.32. Ifa one-tail test is justifiable (as, for example, if in the contrast 3y, — 
(Yo + y3 + y,), the first treatment is control and the other treatments are three types of insecticides), the 5% 
critical value is only 3.46. Since a smaller critical value is easier to exceed, a significant difference is easier to 
declare in a one-tail test. Consequently, the test is less likely to commit a Type II error (failure to declare a 
difference when one exists). 

Two contrasts, C, = Lay and C, = xby, are said to be orthogonal if Lab = 0 (i.e., if the sum of the 
products of the corresponding coefficients in the two contrasts is zero). A set of contrasts is said to be 
mutually orthogonal if all pairs of contrasts in the set are orthogonal. If, for brevity, we write (a,;y; + a.V2 + 

oT ay) dS (ais a ee a), CHE’ three contrasts (1, 1,715 =1), (ri, 11), and (1,%194,--1) are 
mutually orthogonal. It can be proved that there are only (t — 1) mutually orthogonal contrasts among t 
means; however, there are infinitely many such sets of mutually orthogonal contrasts. 

The following are another two sets of mutually orthogonal contrasts: (1, 1, —1, —1), (1, —1, 0, 0), (0, 0, 1, 
—1), and (8, —1, —1, —1), (0, 2, —1, —1), (0, 0, 1, —1). It also can be proved that ifC,, C.,. . ., Cy, are (t — 1) 
mutually orthogonal contrasts, their individual sums of squares add up exactly to the treatments s.s. The 
statistical distributions of these contrasts are independent. This is one reason why, whenever possible, we 
should aim for an orthogonal decomposition of the treatments d.f. Of the possible sets of mutually orthogonal 
contrasts, the experimenter should choose the set that is most interesting or most relevant to his study. 
Mutual orthogonality is desirable but not absolutely essential. If several contrasts interest the scientist, he 
should not let the lack of mutual orthogonality prevent him from performing the statistical tests, as long as 
these contrasts have not been suggested by the data. oes suggested after data snooping should be 
tested by a multiple comparison procedure. 


2.2. Qualitative Factors 


Experimental variables or factors may be divided into qualitative and quantitative factors. Examples of 
qualitative factors are varieties (peanuts, corn, etc.), types (soils, fungicides, etc.), locations, and methods of 
chemical analyses or of counting bacteria. Examples of quantitative factors are temperature, pressure, 
humidity, pH, concentration, and several levels of a fertilizer. Although the various varieties or soil types in 
an experiment also are referred to as the levels of the factors “varieties” and “soil types,” no meaningful 
numerical values can be assigned to the levels of a qualitative factor. Levels of a quantitative variable are, of 
course, naturally numerical. 

Factorial experiments are those in which the treatments are made up of all possible combinations of the 
levels of two or more factors (qualitative or quantitative). (The term “factorial” thus merely describes the 
nature of the treatments and not the design of the experiment, which may be completely randomized, 
randomized block, Latin square, split-plot, etc.) The simplest factorial is the 2? or 2 x 2 experiment, with two 
factors A and B, each at two levels. For the 2 x 2 factorial, the partitioning of the d.f. for treatments is the 
same whether the two factors are both qualitative or quantitative, or one of each kind. The two levels may be 
designated generally as H (high) or L (low). The low level, in particular, may be zero. For a qualitative factor, 
we may arbitrarily label one level H and the other L. The four treatments are denoted by (1), a, b, and ab, 
where absence of a letter implies that the corresponding factor is at the low level; and (1) is a special symbol 
for the treatment where both factors are at the low level. These four treatments could have been more 
explicitly but awkardly denoted by A,B,, AyB,, A, By, and AyBy, respectively. 

The three d.f. for treatments are partitioned into the main effect of A, main effect of B, and their 
interaction. The coefficients for these contrasts are as follows: 


Treatments 
Contrasts (1) yu fD _ab_ 
C, —] 1 —] al Main effect of A 
Cy —l —l 1 1 Main effect of B 
C3 1 —1 —] 1 Interaction of A and B 


The coefficients for the main effect of A are +1 for treatments where A is at the high level and —1if Ais at the 
low level; and similarly for B. The coefficients for interaction are obtained by multiplying corresponding 
coefficients for main effects. To get the sums of squares for the preceding contrasts, we apply Equation (2.1) 
to the four treatment means or totals, using the coefficients for each contrast in turn. 

The difference [a — (1)]is called the simple effect of A at the low level of B; similarly, (ab — b) is the simple 
effect of A at the high level of B. The main effect of A is the average of the simple effects of A. (To avoid 
fractions, the coefficients for this average have been multiplied by two. The reader will recall that s.s. for a 
contrast is unchanged if the coefficients are multiplied by a common number.) 

If the factors A and B act independently, the two simple effects of A should be about the same. 
(Experimental or random errors will prevent them from being exactly equal.) Therefore, their difference 


(AD = by sae) ea tl) te ee 


should be approximately zero if A and B are independent. If this quantity is large (significantly different from 
zero), we say that there is interaction between A and B (i.e., effect of A at low level of B is different from effect 
of A at high level of B). We also can write C; as (ab — a) — [b — (1)] = (effect of B at high level of A) — (effect of 
B at low level of A) so that if effect of A depends on the level of B, we know that the effect of B depends on the 
level of A. 

The following artificial two-way tables of means show some possible results of the tests for main effects 
and interaction. In (d), for example, the simple effect of A is 10 units at low B and 20 units at high B, showing 
dependence of the effect of A on the level of B or interaction between A and B. 


| > 


Low High Average 
Low 10 20 15 
B High 12 24 18 
Average 11 22 
(a) A sig. 
B not sig. 
A x B not sig. 
A 
Low High Average 
Low 10 20 15 
B High 22 34 28 
Average 16 27 
(b) A sig. 
B sig. 
A x B not sig. 
A 
Low High Average 
Low 10 20 15 
B High 6 26 16 
Average 8 23 
(c) A sig. 
B not sig. 
A xB sig. 
A 
Low High Average 
Low 10 20 15 
B High 18 38 28 
Average 14 29 
(d) A sig. 
B sig. 
AxB sig. 


In general, a two-factor experiment is a p xX q factorial. The (pq — 1) d.f. for treatments will be 
partitioned into main effects of A with (p — 1)d.f., main effects of B with (q —1) d.f., and interaction with (p 
—1)(q —1)d.f. The A x B interaction is more difficult to illustrate if p and q are greater than two, but the 
interpretation is similar to that in the 2 x 2 factorial; viz., differences among levels of A depend on the levels of 
B, and vice versa. If the p levels of A are such that orthogonal contrasts are possible, the (p — 1) d.f. for the 
main effects of A should be partitioned further into single d.f. Ifit is impossible to partition the (p — 1) d.f. for 
A, then it is legitimate to use a multiple comparison procedure to compare the p levels of A. 

Testing the main effects of A presupposes that there is no A x B interaction. If interaction exists, the 
differences among the levels of A depend on the level of B. It does not make much sense to compare the levels 
of A averaged over all levels of B, which is what main effect is. It is more instructive to compare the levels of A 


for each level of B separately, and vice versa, using the pooled error mean square from the complete 
experiment, if the assumption of homogeneous variances is valid. 

With three factors, the simplest is a 23 or 2 x 2 x 2 factorial. The eight treatments may be denoted by (1), 
a, b, ab, c, ac, be, abe, in an obvious extension of the previous notation, where, for example, ac stands for the 
treatment with factors A and C at their high level and B at the low level. The seven d.f. for treatments will be 
partitioned into main effects (A, B, C), two-factor (or first order) interactions (A x B, A x C, B x C), and 
three-factor (or second order) interaction (A x B x C), each with a single d.f. Second and higher order 
interactions are difficult to interpret. The A x B x C interaction is the interaction of (A x B) andC. IfA x B 
x C interaction is significant, the A x B interaction at the high level of C is different from that at the low level 
of C. The coefficients for the following contrasts are obtained as in the 2 x 2 factorial experiment. 


Treatments 
(1) a b ab c ac be abe 

A =i 1 =ll 1 = 1 ll 1 

B = =| 1 1 IM =l 1 1 

Ase B 1 -1 -1 1 1 -1 -1 1 

C =I Sal All = 1 1 1 1 

A xcC 1 = 1 al =i! 1 =k 1 

133 S¢ (€ 1 il =) =! =i =I 1 1 

JNe Pe Nee AC =|! il 1 Al 1 =! =I| 1 
The 2 x 2 x 2 factorial can be generalized to the p x q x r factorial (three factors A, B, and C, with p, q, 
and r levels, respectively), to the 2° factorial (p factors, each at two levels), and to the p; X pp X. . . Xp, (r 
factors with p,, po, . . . , pr levels). The total number of treatment combinations increases rapidly with 


increasing number of factors. With six factors, even if each is at two levels, we require 2° = 64 experimental 
units per replicate. Besides the 6 main effects, there will be 15 two-factor, 20 three-factor, 15 four-factor, 6 
five-factor, and 1 six-factor interactions. If we can assume that high order interactions (four-factor or higher, 
say) do not exist, as is usually true, we may pool these interactions for use as error mean square so that we do 
not need to replicate. In fact, a single replicate already may be too large an experiment, and our resources 
may allow us to carry out only a portion of the full factorial experiment. So-called fractional factorial 
experiments are available for this purpose. They are discussed in Davies (1956), Cochran and Cox (1957), 
Peng (1967), John (1971), and Anderson and McLean (1974). 

The following example, taken from Little and Hills (1972), shows the partitioning of treatments d.f. to 
give meaningful single d.f. contrasts. Six sources of nitrogen on yield of sugar beet were compared: Control 
(1), urea (2), ammonium sulfate (3), ammonium nitrate (4), calcium nitrate (5), and sodium nitrate (6). 


Treatments 
Contrasts 1 3 4 5 6 
(OF —5 1 1 1 il 1 Nitrogen vs. no nitrogen 
C, 0 —4 1 1 1 1 Organic vs. inorganic nitrogen 
C; 0 0 = =I 1 1 Ammonium vs. nitrate nitrogen 
; 0 0 -1 1 0 0 Ammonium nitrate vs. sulfate 
C; 0 0 0 =I 1 Calcium vs. sodium nitrate 


The reader should check the mutual orthogonality of the contrasts. Note that the interpretation of Contrast 
C, is not quite right since Treatment 4 contains both ammonium and nitrate nitrogen. 

An interesting factorial experiment was conducted by Dr. Ralph Segall at the U.S. Horticultural 
Research Laboratory in Orlando, Fla. He studied the effects of 10 fertilizer treatments on the incidence of 
postharvest bacterial soft-rot of tomato fruits. The 10 treatments (all of which had 180-25) initially may be 
regarded as a2 x 5 factorial (mulching at two levels and “additives” at five levels). The five additives are 
made up of control and four chemicals. The four chemicals are in the form of a 2 x 2 factorial (2 anions and 2 
cations). We have shown the coefficients for only five mutually orthogonal contrasts. The remaining four 
contrasts are the interactions between C, and each of C2, C3, C4, and C;. The reader may interpret the 
contrasts C,,.. ., C; and the interactions between C, and each of C,,. . . , C;. 
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Contrasts 


Treatments : (65 C; C, Os 
Control (1) 1 —4 0 0 0 
Calcium nitrate (2) 1 1 1 1 1 
Mulched beds Calcium chloride (3) 1 1 1 —] —] 
Potassium nitrate (4) 1 1 —] 1 —1 
Potassium chloride (5) 1 1 —] —] 1 
Control (6) —] —4 0 0 0 
Calcium nitrate (7) —l 1 1 4 1 
Nonmulched beds Calcium chloride (8) —l 1 1 —] 
Potassium nitrate (9) —l 1 —l 1 _ 
Potassium chloride (10) —1 il —] —] 


There may be situations in which it is justifiable to apply a multiple comparison procedure to compare 
factorial treatments. For example, suppose a farmer is interested in growing one of three types of grasses and 
using one of four types of fertilizers. The farmer is not interested in the scientific comparison of yields from 
the three varieties of grasses or types of fertilizers. He is only interested in maximizing his profit. If the 
commercial values of the three grasses and the costs of the four fertilizers are different, analyzing the profit 
(in dollars and cents) per plot is more relevant than analyzing yields per plot. The 12 treatments (combina- 
tions of grasses and fertilizers) may be compared for profitability, using a multiple comparison procedure and 
ignoring their factorial nature. 

At a panel discussion sponsored by the Data Systems Application Division, Agricultural Research 
Service, during the joint meeting of the statistical societies in Atlanta in August 1975, two panel members 
(Dr. David B. Duncan and Dr. John W. Tukey) said they might condone multiple comparisons of individual 
factorial treatments (from qualitative factors) if the main effects were not significant (Duncan) or if their F 
ratios were less than two (Tukey). 


2.3. Quantitative Factors 


2.3.1. One Factor 


With a quantitative factor (e.g., temperature, pressure, humidity, pH, and concentration or levels of a 
fertilizer), regression analysis or curve fitting is the most appropriate technique. The treatments d.f. and s.s. 
should be partitioned into components due to linear (first degree) regression, quadratic (second degree) 
regression, cubic (third degree) regression, and so forth. If enough theoretical knowledge exists to specify 
the mathematical form of the relationship between the response y and the experimental variable x (e.¢., 
logistic, Mitscherlich’s law, Gompertz’s law, von Bertalanffy’s curve, etc.), this equation should be fitted to 
the data. In most (if not all) agricultural experimentations, however, the mathematical relationship between 
the response and the so-called independent variable is so complex that it defies specification. Therefore, we 
must approximate the unknown mathematical relationship by means of a polynomial of the form y = by + b,x 
+ box? +... + bgx*. Within a limited range of the independent variable, a polynomial approximation is 
usually satisfactory if the response does not level off in the experimental range of x, in which case an 
asymptotic curve should be fitted. 

Table 1 shows the analysis of variance of a randomized block experiment with b replicates or blocks, t 
treatments (levels of a quantitative factor), and m measurements per plot (experimental unit), with partition- 
ing of the treatments d.f. and s.s. into linear and quadratic components. With the general availability of 
computer programs, it is not difficult to fit a polynomial of a higher degree than quadratic. The ratio 
ms(dr)/ms(e) provides a test for the statistical significance of the combined contributions from the higher 
order polynomials, sometimes called a test of the lack of fit of the fitted model (in this case quadratic). If 
quadratic is sufficient, this ratio has the F-distribution with (t — 3) and (b — 1) (t — 1) d.f. (For testing, the 
author generally recommends the use of ms(e) rather than ms(s) as the error term since the latter does not 
represent true replications. If b = 1, we are forced to use ms(s) as the error term, but this is dangerous since 
ms(s) may seriously underestimate ms(e) and it will then be easy to get a spuriously significant result. ) 
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If the quadratic term is statistically significant but its s.s. is only a small part of the treatments s.s., we 
may prefer to fit a linear trend only since the curvature of the response curve is only slight. We may be able to 
predict the response y better (i.e., with a smaller mean squared error of prediction) by using a straight line 
rather than a quadratic, even if the true response curve is a quadratic function. The curvature, however, 
must be slight. This comes about through having to estimate fewer parameters (constants of the response 
function) in linear regression. A straight line is also easier to use than a parabolic curve. 

In comparing the effects of, say, 10, 20, 30, and 40 p/m of a certain chemical, if the linear or quadratic 
regression of response on concentration is significant, or both are significant, no multiple comparison 
procedure is necessary. All concentrations are significantly different in their effects. In fact, even 10 and 10.1 
p/m also will be different. Of course, the difference between the effects of 10 and 10.1 p/m will be extremely 
small. The usual significance test is not concerned with the magnitude of the difference, however. It is only 
concerned about whether a true difference exists, no matter how small. 

We have the following possible results with one factor: 


y —_—____. y 
10 20 30 40 10 20 30 40 
x x 
(a) LR (NS) (b) LR (S) 
QR (NS) QR (NS) 


10 20 30 40 
x 


(c) LR (S) (d) LR (NS) 
QR (S) QR (S) 
LR = linear regression; QR quadratic regression 


ou 


Ss significant; NS not significant 


In (a), all treatments (infinitely many between 10 and 40 p/m) are the same, while in (b) and (c) all treatments 
are different. In (d), all treatments less than x* (the value of x that will maximize y) are different. We may 
want to estimate x* and construct confidence limits for it. If y* is the maximum response, we may be 
interested in finding the range of x that will give a response higher than (y* — A), where (y* — A) is an 
acceptably high yield. If it costs more to apply the factor x the higher its level is, we should take z as the 
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Table 1. Analysis of variance of a randomized block experiment to compare effects of several levels of a 


quantitative factor 
Sources of variation ct S.8. m.s. F 

Blocks (B) Jol ss(b) ms(b) ms(b)/ms(e) 
Treatments (T) esl ss(t) ms(t) ms(t)/ms(e) 

Linear regression 1 ss(f r) ms({ r) ms( r)/ms(e) 

Quadratic reg. (additional) 1 ss(qr) ms(qr) ms(qr)/ms(e) 

Deviations from reg. <3 ss(dr) ms(dr) ms(dr)/ms(e) 
Error (B x T) (b=) ss(e) TS(@)aee hk ee Ch es 
Subsampling error bt(m-1) ’ ss(s) TS(3)e > Ae OBI cee 

Total btm-1 ss(T) rele eT be eee 


response variable, where z is the yield per unit cost of application of x. These considerations are more 
meaningful than the question often asked by the naive experimenter: Among 10, 20, 30, and 40 p/m, which are 
different in their effects? 

There are two options if the lowest level of x in the experiment is zero (control). We may fit a regression 
curve to all levels (including zero), or we may isolate a single d.f. for the contrast between zero and nonzero 
levels and fit a regression curve to the nonzero levels only. Quite often the regression is curvilinear in the first 
option and linear in the second option. If this is so, the second method of analysis is preferable, especially if in 
actual usage the factor x will not be applied at a level below the first nonzero level of the experiment. 

For the linear regression model y = by + b,x, the estimated responses at x = x* and at x = x** are y* = by 
+ b:x* and y** = by + b,x**, respectively. Therefore, the estimated difference in response at any two values 
x* and x** is equal to b,(x** — x*), and the variance of this estimated or predicted difference is (x** — x*)? 
(variance of b,). The formula for the variance of b, is given in Equation (2.4). The 100 (1— a)% confidence 
interval for the true difference is b,(x** — x*) + t(a;v) \(x** — x*)? (estimated variance of b,), where t(a;v) is 
the two-sided (100 a)% point of Student’s t-distribution with v d.f. 

For the quadratic regression model y = by + b,x + b.x?, the estimated difference is b,(x** — x*) + 
bo(x**? — x**) with variance equal to [(x** — x*)? (variance of b,) + (x*** — x*?)? (variance of b.) + 2(x** — 
x*) (x**2 — x*?) (covariance of b, and b,)]. Ina good regression computer program, the printout will include 
the estimated variances and covariances of the estimated regression coefficients. 

Because linear relationships occur frequently, we will give the computational results for linear regres- 
sion analysis. In general, let y; be the mean of the n, observations taken at x,, the i" level of the factor (i = 1, 2, 

. ., t). (We are allowing unequal replications here. In Table 1, n; = bm, a constant.) The equation of the 
fitted line is y = by + b, x, where 


hae LNXiYi = (n;x)) (ony »/N (2.2) 
Enjx2 — (2n,x,)?/N | 
bo = (Cony) — b, (2n,x) VN, (2.3) 
andN =(n, +n. +. . . +n,), the total number of observations. (In the simplest linear regression problem, n, 
=Ny =. . . =m =1, and the above formulas for the slope and intercept of the line will reduce to more familiar 


ones.) The s.s. for linear regression is (Num.)?/Den., where “Num.” and “Den.” are the numerator and 
denominator, respectively, of the expression for b, above. The s.s. for deviations from regression, now with (t 
— 2) d.f. if we are only fitting a straight line, is most conveniently obtained by subtracting ss( r) from ss(t), 
the treatments s.s. Finally, the variance of b, is 


var. (b,) = o2/[2nx? — (2n,;x,)?/N], (2.4) 


and o? may be estimated by ms(e) in Table 1, or by ms(dr) if b = 1. 

If the levels are replicated equally and spaced equally, the computations for obtaining the various s.s. for 
regression will be simplified considerably by the use of orthogonal polynomials, shown in Table 2 for 3, 4, and 
5 levels only. For more extensive tables and discussion of the method for getting the actual regression 
equation, see Fisher and Yates (1963). If we look at t = 4 levels, say, in Table 2, we see that the three sets of 
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coefficients form a set of mutually orthogonal contrasts. (A polynomial curve of degree (t — 1) will pass 
through the t means exactly.) With these coefficients, we can obtain the s.s. for linear or quadratic 
regression, using Equation (2.1) in the previous section on orthogonal contrasts. An example follows. 


Table 2. Orthogonal polynomials 
(t = number of levels; d = degree of polynomial) 


t=3 t=4 t=5 
d=1 d=2 d=1 d=2 d=38 d=1 d=Z d=3 d=4 
—1 +1 —3 +1 —] =2 +2 =] +1 
0 = —1 —1 +3 —] —] +2 = 
+1 +1 +1 —] —3 0 —2 0 +6 
+o +1 +1 +1 =] =2 —4 
+2 +2 +1 +1 


Chew (1962) discussed published results of an experiment wherein the research worker erroneously 
concluded that there were no treatment differences, through failure to partition the treatments d.f. and s.s. 
Table 3 shows the analysis of variance and treatment means with b = 5 blocks, t = 4 treatments (0, 2, 4, and 6 
degrees of angle), and m =5 repeated measurements on each experimental unit. (The response was the force 
in pounds required to separate a set of electrical connectors at various angles of pull.) The treatment means 
show increasing response with increasing angles. Each treatment mean was an average of nj = bm = 25 
observations. From the coefficients in Table 2, the means in Table 3 and Equation (2.1), we have the following 
sums of squares for regression: 


25[(—8) (41.94) + (—1) (42.36) + (1) (48.82) + (8) (46.30) P 


linear regression = = 264.26 
eed rina a Otros AW pease he 

fundratic repre safari: iinae Sou ates aes) MAA ie a8 Bee) eo 
(2 + (-b? + (-b? + G? 

cubic regression = __25[(=1) (41.94) + @) (42.36) + (~8) (48.82) +) 46.30)F = = 9.01 


(Ca oie aD enh ats i ra Be 


In a two-tail test, the F-ratio for linear regression is significant at between the 242% and the 1% level. Ina 
one-tail test it will be significant at between the 144% and the %% level. (A one-tail test could be justified 
here.) 


Table 3. Analysis of variance and means 


Source of variation dit. S.S. m.s. 
Blocks 4 1234.83 308.71 
Treatments: 3 290.79 96.93 2.56 (not sig.) 
Linear regression 1 264.26 264.26 6.97* 
Quadratic regression 1 26.52 26.52 <1 
Cubic regression 1 01 01 <1 
Error 12 455.03 37.92 
Subsampling error 80 316.50 3.96 
Total wie 2297.15 
x 0 2 4 6 
y: 41.94 42.36 43,82 46.30 
Difference: 0.42 1.46 2.48 


With n; = 25 and N = 100, the formulas for the slope and intercept give: 
bye (25)(0(41.94) + 2 (42.86) + 4 (43.82) + 6 (46.30)] — (25) (12) (25) (174.42)/100 
(25) (0 + 4 + 16+ 86) — [25(12)P/100 


tat 
bo = [25 (174.42) — 0.727(25) (12)V100 = 41.424, 
so that the equation is y = 41.424 + 0.727x. 

Since regression is significant, no multiple comparisons are necessary. The treatments are ALL different 
(in their effects). For example, 0 and 2 degrees are different (without testing), as well as 0 and 1 degree or 
even 0 and 0.1 degree. This equation gives an estimate of y for any given x; and, clearly, for two different 
values of x, the equat on gives different values of y. The difference in response at x = x* from that at x = x** is 


Wiatexe ye ovate x em Oo Tore Mae ex 4) 


and its estimated variance is (x** — x*)? (87.92)/{(25) (56) — [25(12)P/100} = 0.0758 (x** — x*)?, using 
Equation (2.4) for the variance of b,. The 95% confidence interval for the difference in the two responses 
corresponding to a unit difference in the x values is 0.727 + 2.179 \/.0758 = 0.727 + .600 = (.127, 1.327). 

If the observed means of the t levels are in increasing (or decreasing) order and t is at least four, no 
further statistical test is necessary to establish significance of treatment effects, ifit is knowna priori that the 
effect of treatment, if any, is to increase (or decrease) the response, for the probability of the t means falling in 
that order under the null hypothesis is 1/(t!) < 1/24, ift = 4, which is significant at the conventional 5% level. If 
there is no prior knowledge of the direction of the treatment effect, a two-sided test is necessary and t has to 
be at least five for the ordering of the t means to be significant at the 5% level. 

For a criticism of the widespread misuse of Duncan’s multiple range test in agricultural research to 
compare levels of a quantitative factor, see Mead and Pike (1975), particularly Section 2.2. 


2.3.2. Two or More Factors 


For one quantitative factor, we partition the treatments d.f. into linear, quadratic, cubic, etc., regression, 
which is equivalent to fitting a polynomial of the form y = by + b:x + box? +. . . + bax’, where y is the 
measured response and x is the level of the experimental factor. We similarly analyze two quantitative factors 
A and B. Denote the levels of A and B by x, and x., respectively. The following are the first and the second 
degree (or order) polynomials in two variables: 


y = bo + bix, + box, (first order) 
y = bo + (bX, + beX_) + (Di1X1? + bi2X1X2 + be2X2”) (second order) 


In the second order polynomial, the coefficients b,;, b,2, bz: could have been replaced by bs, b,, bs. The double 
subscript, however, reminds us that these are the coefficients for the quadratic terms. Just as the second 
order model is obtained from the first order model by adding the second order (or quadratic) terms, we 
similarly obtain the third order model by adding the cubic terms (b,11:X1° + by12X1?X2 + by22X1X2” + bee2X2*) to 
the second order model. 

In partitioning the d.f. in a 2 x 2 factorial, we are in essence fitting the model y = by + biX; + bexX2 + 
b12X1X, an incomplete second order model: (With only two levels, we cannot estimate squared terms.) 

In a3x3 factorial, the 2 d.f. for each of the two main effects may be further partitioned into linear and 
quadratic terms. The 4 d.f. for the A x B interaction may be partitioned into products of the linear and 
quadratic terms of the main effects. Therefore, we are fitting the model 

y = bo + (dix, + 6X1?) + (baX_ + by2Xo”) + (Di2X1X2 + Dy22X1X2”_ + Bii2X12X2 + Dy122X1?X2”), 
(main effects of A) (main effects of B) (interaction A x B) 


which is a second order model plus two cubic and one quartic terms. 

Table 4 gives the analysis of variance of a randomized block experiment with b blocks and t treatments, 
with the t treatments forming a p x q factorial. This table should be compared with Table 1 for one 
quantitative factor. (Ifm measurements were made on each experimental unit, we will assume that they have 
been averaged; otherwise, there will be an extra line in the analysis of variance, as in Table 1.) The 2 d.f. for 
linear regression may be further partitioned to show the individual contributions from x, and x, separately. 
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They are partitioned similarly for quadratic and cubic regressions. The sums of squares in the s.s. column 
usually are called the sequential sums of squares. For example, ss(qr) is not the total quadratic regression 
s.s.; it is the additional s.s., after fitting a linear model. In other words, ss(qr) is the difference in regression 
sums of squares between fitting a linear model and a full quadratic model. If the true model (true state of 
nature) is linear, ms(qr), ms(er), and ms( X of) will be almost the same as ms(e), the error m.s. The quadratic 
model has 5 coefficients (other than the intercept by); therefore, it has 5 d.f. and its s.s. is obtained by adding 
ss(f r) and ss(qr). Ifp =q =5(i.e., a5x 5 factorial), t= pq = 25 and “lack of fit” has (t—10) =15d.f. Ifwe are 
certain that a cubic model is adequate, and this is usually so, we do not need any replication. We can use 
ms( £ of) as the error m.s. in making tests of significance. With replication, however, we can test the cubic 
model. The extension of Table 4 to three or more quantitative factors should be obvious. 


Table 4. Analysis of variance of a randomized block experiment 
with 2 quantitative factors 


Sources of variation duets Sst m.s. 


Blocks (B) b-1 ss(b) ms(b) 

Treatments (T) t -—1 ss(t) — 
Linear regression 2 ss(X r) ms(f r) 
Quadratic reg. (additional) 3 ss(qr) ms(qr) 
Cubic reg. (additional) 4 ss(er) ms(cr) 
Lack of fit t —10 ss(& of) ms(4 of) 

Error (B x T) (b — 1) (t -— 1) ss(e) ms(e) 

Total bt — 1 ss(T) 


Since getting the various s.s. is extremely tedious on a desk calculator, a computer is necessary. If the 
levels of A and B are equally replicated and equally spaced (e.g., 5, 10, and 15 units for A and 100, 200, and 300 
p/m for B), we can use orthogonal polynomials, as in the one-factor case. We illustrate this with a 3 x 3 
factorial. From Section 2.3.1, we know how to obtain the linear and quadratic regression s.s. for A and for B, 
using either the means or the sums for the levels of A and of B. Table 5 gives the coefficients for getting the 
S.s. corresponding to x,X2, X,?X2, X,X_”, and x,?x,”. The coefficients will operate on the treatment means as 
usual. For example, if we denote the treatment means by y,,. . . , Ygin the order shown in Table 5, the s.s. 
corresponding to x;xX. (or A; x B,) is, from Equation (2.1) in Section 2.1, equal to b(y¥, — ¥3 — Y7 + y)2/4, where 
b is the number of observations in each mean. We also can use the coefficients in Table 5 to get the s.s. for A,, 
Ag, B, and Bg, but these can be obtained more easily from the three means for the three levels of A, and 
similarly for B. The reader should verify that the coefficients for the components of the main effects are 
similar to those given in Table 2. As before, the coefficients for interactions are the products of corresponding 
coefficients for the main effects. With Table 5 as an example, the reader should have no difficulty in extending 
this toa3 x 4or4 x 5 factorial, or to more than two factors. As an exercise, the reader should write the 
coefficients for a2 x 3 x 8 factorial. 


Table 5. Orthogonal polynomials for 3 x 3 factorial (equally spaced) 


Treatments 
N= A=2 A=3 
B: 1 i 3 1 2 3 1 2 3 
x, or Ay: —] —1 =] 0 0 0 1 1 1 
x? or Ag: 1 ae =e =2 1 1 1 
x, or B,: -1 0 1 —1 0 1 —1 0 1 
x3 or Ba: 1 2 1 1 2 1 1 —2 1 
Xx OL ATX By: 1 0 —1 0 0 0 —1 0 1 
x?x, or Agx B,: —] 0 1 2 0 —2 —] 0 1 
X,x3 or A, x Ba: —] 2 —l 0 0 0 1 —2 1 
xix? or Agx Bog: 1 ye 1 =e 4 —2 1 -2 1 
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As in the one-factor case, if regression (whether linear or quadratic) is significant, then all treatments 
are different and no multiple comparison procedure is necessary. Suppose a second order model is necessary 
and sufficient. We can use this model for interpolation; i.e., to predict the response y at any point within the 
range of the values of the two factors used in the experiment. Polynomials are notoriously bad for extrapola- 
tion. We also can find the combination of values of x, and x, that will optimize (maximize or minimize) y. To do 
this, we differentiate y with respect to x, and x2, set these two derivatives to zero, and solve the two resulting 
equations. The solution is: 


X41 = (2b, b22 —bb12)/(b13 —4b, by») 
X% = (2beb1; —b,b12)/(b13 —4b, 1b»). 


These values of x% and x% (if the true values of the b’s are known) will optimize y. The éstimated optimum 
value of y is obtained by putting the estimated values of x* and x¥% (in terms of the estimated b’s) into the 
second order model. 

Ifthe two factors are two kinds of fertilizers, say, the optimum y may require such a large amount of both 
fertilizers that it will not be economically optimum. Instead of fitting a model to the yield y, perhaps we should 
fit a model to z, the yield per dollar of fertilizers applied, and optimize z. 

If the response surface (value of y as x, and x, vary) is highly peaked at the optimum, we should not stray 
far from the optimum combination of x, and x, because y will drop sharply. On the other hand, if the response 
surface is rather flat near the optimum, we can depart from the optimum condition without any appreciable 
decrease in y and the other combinations may be more convenient. One way to study the response surface is to 
draw contours. Suppose the estimated optimum value of y is 188, say. We can set y = 135, 180, 125, etc., inthe 
second order model. These values will give us the sets of values of x, and x, that will give an estimated yield of 
135, 180, ete. 

We also can use the equation to estimate the difference in the response at two different points. For 
example, for the same value of x, but different values of x, (x and x3, say), the difference in the responses 
is y(X,,x3) — y(X1,X2) = (xf—x,) be + x,(x?—Xz2) bin + (x¥?—x,?) bo. and its variance is (x*—xj{)? 
V(b2) + x3(x}—x2)? V(bi2) + (x¥?—x:?)?V (bo) + 2xi(x¥—xy)? Cov(be,bi2) + 2(x#—xz) (x¥?—x,?) 
Cov(be,be2) + 2x(x¥ —x2) (x¥?—x3?) Cov(b12,b22). Similarly, we can estimate y(x*,x.) — y(x/,x») and 
y(x*,x%) — y(x;,x»), and their standard errors. Variances and covariances of the regression coeffi- 
cients will be included in the computer printout from a good regression analysis program. 

We conclude by mentioning a question of experimental design. Box and Wilson (1951) pointed out that 
the squared terms in the second order model are estimated with relatively low precision in a3 x 3 factorial. 
Box and his coworkers have developed so-called response surface designs. The texts mentioned previously 
for fractional factorials also contain discussion on response surface methodology. Further references are Box 
and Hunter (1958) and Myers (1971). 


2.4. Mixed Factors 


Consider two factors A and B, with p and q levels respectively, where A is qualitative and B is 
quantitative. An example would be an experiment comparing several varieties of peanuts and several rates of 
a fertilizer, or destruction rates of a certain bacteria at different temperatures, using several culture media. 

Table 6 shows the analysis of variance of a randomized block experiment, showing the partitioning of the 
d.f. for the pq treatments. We have partitioned the d.f. for the main effects of B into linear and quadratic 
regression only, but a higher polynomial also may be fitted. If the levels of B are spaced equally, ss(B,) and 
ss(Bg) will be easy to get, using orthogonal polynomials, and ss(Bg) will be obtained by difference, using ss(B). 
If the levels of A are such that meaningful orthogonal contrasts can be formed among them (before looking at 
the data), we should partition its d.f. accordingly, and also the d.f. for Ax B,, etc. 
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Table 6. Analysis of variance of a randomized block experiment with 1 qualitative (A) and 1 quantitative (B) 


factor 
Sources of variation d.f. Sts: m.s 
Blocks b-1 ss(b) ms(b) 
Treatments (T) pq-l ss(t) _— 
A p-l ss(A) ms(A) 
B q-l ss(B) —_ 
Buin. 1 ss(B,) ms(B,) 
Bauaa. 1 ss(Ba) ms(Ba) 
Rest q-3 ss(Bp) ms(Br) 
AG (p-1) (q-1) ss(AB) — 
A X Bun. p-l ss(AB,) ms(AB,) 
AX Bouaa. p-l ss(ABg) ms(A Bg) 
A X Brest (p-1) (q-3) ss(ABa) ms(AB,) 
Error (B x T) (b-1) (pq—1) ss(e) ms(e) 
Total bpq-l ss(T) 


If a computer is not available and the levels of B are equally spaced, one way of getting ss(AB,) is as 
follows. Suppose p = 4, so that A has three d.f. Arbitrarily partition these three d.f. orthogonally. A 
convenient set of coefficients is A,= (1, —1, 0, 0), A, = (1, 1, —2, 0), and A, = (1, 1, 1, —3). Multiply these 
coefficients by those for By jnear, to get A; X Brin, Ao X Brin, and A; xX B,jn., analogously to Table 5. The sum of 
the sums of squares for these three interactions is ss(AB,). We get ss(ABg) similarly. Davies (1956) discusses 
another method and gives two numerical examples. 

Notice that if p = 5 and q =6, say, A X Brey has 12 d.f. If this can be assumed not to exist, which is quite 
reasonable, we can use its m.s. as error and need no replication (i.e., b = 1). 

As before, it does not make very much sense to test main effects of A and B if their interaction is 
significant. If A < By inear interaction is significant, the linear regression on the levels of B (1.e., the slope of the 
line) is not the same for all levels of A, as shown in the following diagram. (This is an extreme case; usually the 
signs of the slopes are all the same.) Similarly, if A X Baguaaratic iS significant, the curvature of the regression 
curve depends on the level of A. If there is no interaction, the lines (or curves) will be essentially parallel, 
possibly differing only in heights to reflect different effects of the qualitative factor. If it is impossible (a 
priori) to partition the d.f. for A, the levels of A may be compared using a multiple comparison procedure. 


Variety 1 
S 
a Variety 2 
fs) 
a 
s 

Variety 3 


Levels of B 


CHAPTER 3. MULTIPLE COMPARISON PROCEDURES 


In comparing t (three or more) treatments, the null hypothesis tested is that the t true means are all 
equal (Ho: uw; = M2 =. . . =). The alternative hypothesis, in general, is not that the t means are all unequal 
(although this may be true), but merely that they are not all equal. (For example, all but one treatment may 
be equal.) The next step, therefore, is to determine which treatments are different, using a so-called multiple 
comparison procedure. We suppose that it is impossible (a priori) to partition meaningfully the degrees of 
freedom for treatments; otherwise, the problem belongs in Chapter 2 and no multiple comparison procedure 
is needed. 


3.1. Error Rates 


Before describing the multiple comparison procedures that have been proposed, we will discuss the 
question of error rates further. We should not test all possible pairs of means with the ordinary or Student’s 
t-test because it is relatively easy to commit a Type I error (saying two treatment means are unequal when, in 
fact, they are equal). For example, if we carry out a supposedly 5% t-test (i.e., 5% probability of committing a 
Type I error) with, say, 40 degrees of freedom for error mean square to compare all possible pairs, the 
probability is actually 20%, 35%, 48%, 59%, and 68% that the extremes (largest and smallest) of 4, 6, 8,10, and 
12 means, respectively, will be declared significantly different, when the true means are, in fact, all equal 
(David 1962, p. 145). 


When comparing three or more treatments in an experiment, there are at least two kinds of Type I error 
rates, based on the comparison and the experiment as the basic counting units. These rates are defined as: 
Comparisonwise Type I error rate = (Number of comparisons incorrectly declared significant)/(Total 
number of nonsignificant comparisons tested). 
E;xperimentwise Type I error rate = (Number of experiments with one or more comparisons incorrectly 
declared significant)/(Total number of experiments with at least two equal means). 
If each experiment has only two treatments, these rates are identical. 


Suppose Statistician A always performs his statistical tests at the 5% comparisonwise level. Each true or 
nonsignificant comparison will have a 5% probability of being incorrectly declared statistically significant. If, 
in his professional career, he makes N comparisons altogether and in M of these the null hypothesis 1 is true, 
then approximately 5% of those M true comparisons will result in rejections. 


In the experimentwise error rate, the experiment is the unit and no distinction is made between 
incorrectly rejecting one comparison and incorrectly rejecting, say, 10 comparisons in the same experiment. 
A Type I error is committed for the whole experiment if a Type I error is committed for one or more of the 
comparisons within that experiment. It does not distinguish between an experiment with true means equal 
to 0, 0, 0, 0, 1, say, and one with true means 0, 0, 1, 2, 3. It is easier to make one or more incorrect rejections in 
the former experiment, when comparing all possible pairs of means. It also does not distinguish between an 
experiment with 2 treatments and one with 20 treatments. Intuitively, we might feel that an incorrectly 
rejected comparison is more serious in an experiment with 2 treatments than in one with 20 treatments. The 
former experiment has only 1 comparison, while the latter has 190 possible comparisons. All other things 
being equal, it is obviously much easier to make one incorrect rejection in a large experiment with many 
treatments than in a small experiment with few treatments. Thus, a 5% experimentwise error rate is much 
more stringent than a 5% comparisonwise error rate. The relative frequency interpretation is as follows. If 
Statistician B always uses a 5% experimentwise error rate and throughout his career he analyzes N 
experiments and in M of these at least two treatment means are equal (so that an incorrect rejection is 
possible), then in approximately 5% of the M experiments, one or more comparisons will be rejected 
incorrectly. 

For orthogonal comparisons and infinite d.f. for the error m.s. (so that the tests will be statistically 
independent), the experimentwise error rate (EH, say) is related to the comparisonwise error rate a and the 
number of treatments t as follows: 


E = 1-(1-0)"; @ = 1-(1-E)!¢»- (3.1) 
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If a = .05, this equation gives E = .05, .0975, .1426, .1855, .2263, .2649, .3017, .3366, .3698, .5124, and .6227 
fort =2,3,. . .,9, 10, 15, and 20, respectively. Thus, if we test each of the 9 orthogonal comparisons at the 
5% level, in an experiment with t = 10 treatments (and the null hypothesis Hy, is true), the probability of 
rejecting (incorrectly) one or more comparisons is 36.98%. The overall protection against incorrectly reject- 
ing any of the nine comparisons is 63.02% in this example. 

IfE =.05, the preceding equation gives a = .05, .0258, .0169, .0127, .0057, .0037, and .0028 fort =2, 3, 4, 
5, 10, 15, and 20, respectively. Thus, if we wish to hold the experimentwise error rate to 5% (i.e., 5% 
probability of rejecting one or more orthogonal comparisons in an experiment where all treatments are equal 
or, equivalently, 95% protection against incorrectly rejecting any comparison), we have to make each 
comparison at a = .0057 (i.e., the 0.57% level) if there are 10 treatments in the experiment. 

There is no rigid rule or criterion that enables us to decide whether a comparisonwise or an experi- 
mentwise error rate is more appropriate. It is mostly a subjective choice. An experimentwise rate is more 
conservative in that fewer Type I errors (false significances) will be made; however, more Type II errors 
(failure to detect true differences) will be made. A similar problem exists in choosing the significance level a 
in the simple two-treatment case. Should a be taken to be .05 or .01? In situations where incorrectly rejecting 
one comparison may vitiate the entire experiment or incorrectly rejecting one comparison is as serious as 
incorrectly rejecting 10 comparisons, an experimentwise error rate is more pertinent. A comparisonwise 
error rate should be used if one faulty inference does not affect the remaining inferences from the same 
experiment. The author favors comparisonwise error rates in general. For further discussion of error rates, 
see Tukey (1953b), Harter (1957), and Federer (1961). 

We shall now describe the multiple comparison procedures in turn. Some textbooks that contain a 
discussion of this topic are Federer (1955), Steel and Torrie (1960), Scheffé (1959), Seeger (1966), Kirk (1968), 
Bancroft (1968), and Miller (1966). Some review papers on this topic are Hartley (1955), Cornell (1971), Gill 
(1973), Games (1971), Ryan (1959), O’Neill and Wetherill (1971), Thomas (1973), Waldo (1976), etc. The 
O’Neill and Wetherill paper has a bibliography of 234 references, classified into 15 categories (multiple range 
tests, error rates, simultaneous confidence intervals, etc.). Thomas has an unpublished bibliography on 
multiple comparison techniques (available from him) containing about 300 references up to 1970. 


3.2 Fisher's Protected and Unprotected LSD Methods 

Fisher’s protected LSD (least significant difference) procedure is to be applied only if the overall F test 
for treatments is significant. It consists of applying the ordinary Student’s t test to any pair of means j, and J;. 
Let s? be the error mean square (with v degrees of freedom) from the analysis of variance table, and n, and n; 
be the number of replications of treatments i and j, respectively. The two treatments will be declared 
different if the two observed means y;, and y; differ (in absolute magnitude) by more than the LSD given by 


LSD = t(a,v) Vs?[(/n,) + (1/n,)], (3.2) 


where t(a,v) is the tabulated two-sided (100 a)% value of the t-distribution with v degrees of freedom; e.g., 
t(.05, 30) = 2.04. 

Besides permitting unequally replicated treatments, the procedure is applicable for interval estimation. 
Thus, the 100(1 — a)% confidence interval for (u; — 4;) is (y¥; —y;) LSD. (Note that if the difference between 
y; and jy; is less than the LSD, the confidence limits will have different signs so that the hypothesis of equal 
means is accepted. Recall the connection between hypothesis testing and interval estimation mentioned in 
chapter 1.) A third desirable feature is its ease of application, especially if all treatments are replicated 
equally. The LSD for all pairs of treatments is t(a,v) \/2s?/n, where n is the common number of replications. 
(It is possible for the overall F test to be significant but none of the t tests for the pairwise differences to be 
significant. See Miller (1966, page 91). 

To illustrate the method we will use the data in Duncan (1955) from a randomized block experiment with 
six blocks and seven treatments (varieties of barley). The analysis of variance gave a treatments mean square 
of 366.97 (with 6 d.f.), an error mean square (s?) of 79.64 (with v = 30 d.f.), with a highly significant F ratio of 
4.61. The means (in bushels per acre) of the seven varieties, given below, have been relabeled A through G in 
increasing order. 
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49.6 58.1 61.0 61.5 67.6 71.2 71.3 
A B C D E F G 


With v = 30 and taking @ to be 0.05, t(a,v) = 2.04 and the LSD = 2.04 x 1/2(79.64)/6 = 10.51. Any two means 
differing by more than 10.51 will be significantly different at the 5% level. We systematically test G —A, 
G—-B,G —C,G—D, G -E, G-F;F —-A, F -B,. ..,F -E;E—A,. ..,E-—D;. . .;B—A. Inpractice, of 
course we may not need to test all possible pairs. For example, once we have found G — C = 10.3 to be less 
than the LSD, we need not test G —D, G —E, and G —F, for these cannot be significant. The results usually 
are presented by underscoring (means underscored by the same line are not significantly different) or by 
using superscripts (means having the same superscript are not significantly different). For the preceding 
example, the results are as follows: 


49.6° 58.1%¢ 61.08? Ole. Gr.628 az ABs 
A B C D E F G 


Another way of presenting the results, which is typographically convenient, is to group the means as follows: 
(A,B), (B,C,D,E), and (C,D,E,F,G). Means in the same parentheses are not different. There were seven 
differences (GA, GB, FA, FB, EA, DA, and CA). An unpleasant feature of many multiple comparison 
procedures is the lack of “transitivity.” In the preceding example, (A and B) and (B and C) were the same, but 
A and C were different. 

This procedure is satisfactory if Hy is true. However, suppose H, is false such that all means but one are 
equal, and this single mean is much larger (or much smaller) than the other (t—1) means. The overall F-test 
will be significant, and repeated t-tests applied to the (t—1) equal means will have a large probability of 
declaring some of these (t—1) means to be unequal. This objection is removed in the Newman-Keuls’ 
procedure, to be discussed in Section 3.3. 

In the unprotected LSD method, a preliminary F test need not be carried out at all, but the error rate for 
each individual comparison is reduced to a/m, where m is the total number of comparisons (preferably 
specified in advance) that we wish to make among the t treatments. If we restrict ourselves to orthogonal 
contrasts, m = (t—1); if we make all possible pairwise comparisons, m = t(t—1)/2. More generally, we can 
budget m different error rates a, @2,. . . , @m for the m contrasts, where these add up to a. If it is more 
serious to incorrectly reject the i-th contrast than the j-th contrast, we would choose a; < qj. It can be 
shown (using the so-called Bonferroni inequality) that the experimentwise error rate E is at most a. 
Percentage points of the t-distribution for carrying out Fisher’s unprotected LSD procedure may be found in 
Table A in the appendix, reproduced from Dunn (1961). Alternatively, Scheffé (1959, page 80) gives the 
following approximation (due to A.M. Peiser) for the upper (one-sided) a point of the t distribution with v d.f.: 


[Wines Alls CED) el cpa ara3) 


ayy 


where z, denotes the upper a point of the standard normal distribution; e.g., Zo; = 1.645. 


3.3. Newman-Keuls’ Multiple Range Test 


This method is applicable only in situations where all t treatments are equally replicated n times. As in 
Section 3.2, s? is the error mean square with v degrees of freedom. This method does not have a prior 
significant F test as a prerequisite: To apply the method, we arrange the means in ascending order, but 
instead of comparing the difference between any two means with a constant least significant difference (as in 
Section 3.2), we test it against a variable yardstick 


W, = qa; p, v) Vs2fn, (3.3) 


where p(=2,3,. . ., t)isthe number of means whose range (i.e., largest-smallest) we are testing, and q(a; 
p, v) is the (100 a)% point of q(p, v), the distribution of the studentized range of p means and v degrees of 
freedom. Values of q(a; p, v) are tabulated in Pearson and Hartley (1966) and Harter (1960a). They are 
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reproduced in condensed form in the Appendix (Table B), Beyer (1968), Miller (1966), Steel and Torrie (1960), 
etc. 

For the numerical example in Section 3.2, t =7, v =30, and \/s?/n = 1/79.64/6 = 3.643. For a = .05, the 
values of q are: 


p: 2 3 4 5 6 7 
q(.05; p, 30): 2.89 3.49 3.85 4.10 4.30 4.46 
W, = 3.643q_: 10.53 12.71 14.03 14.94 15.66 16.25 


Fisher’s LSD and W, are identical. We test G—A against W, = 16.25 since G —A is the range of 7 means. 
There are 2 ranges of 6 means (viz., G—B and F —A), and these are compared with W, = 15.66. Similarly, we 
test the three five-mean ranges G—C, F—F, E—A against W,=14.94; G—D, F—C, E—B, D—A against W,= 
14.08; G—E, F—D, E—C, D—B, C—A against W, = 12.71; andG—F, F—E, E—D, D—C, C—B, B—A against 
W, = 10.53. In practice, we need to perform much fewer tests than these, for once two means are judged to be 
not different, they are underscored by a line, and no further testing is made among means that are between 
the two means so underscored. We need only test G—A = 21.8>W,, G—B = 13.2<W, (underscore), F—A = 
21.6 > W,, E-A = 18.0 > W;, and D—A = 11.9 < W, (underscore). No further testing is necessary. The 
results are as follows: 


Aa Beh C9. D2? WP eG or, (A) B.C aD) and BiGe Dak Hb aGy 


This method gives only 3 significant pairs (G—A, F—A, and E—A), compared to 7 pairs from the LSD 
method. The Newman-Keuls’ procedure is intuitively more appealing than the LSD method. One feels that 
the difference between the extremes of 7 means should pass a more stringent test than the difference between 
the extremes of, say, 3 means. The method has the disadvantage of not being amenable to interval estimation. 
The error rate is confusing because it is neither experimentwise nor comparisonwise. At each stage of testing 
(range of t means, (t —1) means, etc.), the probability of rejecting the hypothesis of equal means, if true, is a. 


3.4. Tukey’s HSD Method and Multiple Range Test 


Tukey’s original HSD (honestly significant difference) procedure (1951, 1953) requires equal replica- 
tions. It has the simplicity of Fisher’s LSD method in having a constant yardstick with which to test all pairs 
of treatment means. The HSD is calculated as the W, of the Newman-Keuls procedure, with p taken at its 
maximum value (i.e., with p = t, the total number of treatments). Thus, two treatments are declared to be 
different (in their effects) if the absolute magnitude of the difference between their means exceeds 


HSD = W, = q(a; t, v)-Vs?/n, (3.4a) 


where the symbols are as in Equation (3.3). 

In the previous example, with t = 7 treatments, error mean square s? = 79.64 with v = 30d.f. andn =6 
replications, the HSD = q(a;7, 30) x 3.643 = 4.46 x 3.643 = 16.25, ifa = .05. Testing the difference between 
every pair of means against 16.25, we get results that are identical to those given by the Newman-Keuls 
procedure. In general, we shall get fewer significant differences from Tukey’s method. Since error rate of 
Tukey’s HSD method is experimentwise, Hartley (1955) recommends that a be taken as 0.10 or higher. 

Tukey’s HSD procedure also can be used to construct simultaneous confidence intervals for all pairs of 
treatment differences as follows: 


Prob. {(u;—p,) lies within (y,-y,) + HSD: i,j =1,2,. . ., t} = (1-a). (3.4b) 


In words, Equation (3.4b) states that the probability is 0.95 that all of the following statements are true: 
Peo—pPa = (71.8 = 49.6) + 16.25; ug—ug = (71.8 — 58:1) + 16.25; .. 2; 
bg=Mp = (71.8 = T1.2)-+ 16.25; pp—p, = (71.2 — 49.6) + 16.25; 2. 
fep—Pe = (71.2 "= 67:6) £16.25; . . 2 3 wp —a, = (68s 749.6) 4016-25, 
Equation (3.4b) can be generalized to simultaneous confidence intervals for linear contrasts among the t 
treatment population means, as shown in Equation (3.4c). 


t t t 
Prob. re:  ¢m; lies within Y e«y,+% (SD) 2 joy =(1-a), (3.4¢) 


for all sets of coefficients (c,, ¢,. . . , ¢) satisfying Sc, = 0. (There is an uncountable infinity of such sets.) 
Equation (3.4c) immediately reduces to (3.4b) if the contrast is a pairwise difference, for then one coefficient is 
+1, another is —1, and the rest are zero. Equation (3.4c) also enables us to test a more general hypothesis Hy: 
cu; = d (specified). We reject H, if the confidence limits for the contrast exclude d. Gabriel (1964) shows that 
at least one contrast will be significant if, and only if, the overall F test is significant. This is not true if the 
contrasts are restricted to paired differences only. 

To overcome the conservativeness of his HSD procedure, Tukey also has proposed a multiple range test, 
using the average of his HSD and the Newman-Keuls statistic as the test criterion. Thus, the range of p 
ranked means is tested against 


Ye[q(a; p, v) + q(a; t, v)]Vs?/n. (3.4d) 


Spjg@tvoll and Stoline (1973) and Hochberg (1975, 1976) have extended Tukey’s HSD procedure to allow 
unequal variances or unequal sample sizes. If sample sizes are unequal, two approximate procedures are to 
use the harmonic mean of the sample sizes (reciprocal of the arithmetic mean of the reciprocals of the sample 
sizes) or to replace the estimated variance of a mean (s?/n) in Equation (3.4a) by the average of the variances of 
the two means concerned, viz., s?[(1/n,) + (1/n;)/2, as in Kramer’s (1956) modification of Duncan’s multiple 
range test. Keselman, Toothaker, and Shooter (1975) found that these two methods “have the same 
sensitivity for detecting real mean differences.” 


3.5. Scheffe’s Method 


Like Tukey’s HSD, Scheffé’s (1953) procedure is applicable to general contrasts, and not just paired 
comparisons. Since it employs an experimentwise error rate, Scheffé (1959, page 71) suggests taking a = .10. 
Scheffé’s procedure is more general than Tukey’s in being able to handle unequal replications. Let n, be the 


number of replications of the i-th treatment. The contrast C = % cy; will be estimated by C = Yeay;, with 
variance estimated by i=l 


V(C) = s?B (c?/n,), (3.5a) 


where s? is the error mean square (from the analysis of variance table) with v degrees of freedom, say. The 100 
(1 — a)% simultaneous confidence intervals for all contrasts C (uncountable infinity of them, obtainable by 
varying the set of coefficients ¢c,, ¢,,. . . , ¢) are 


C€ + V(t—1).F(a;t-1,v). VO), (3.5b) 


where F(a; t—1,v) is the upper (100 a)% point of the F-distribution with (t—1) and v degrees of freedom (for 
numerator and denominator, respectively). As an example, F(.05;6, 30) = 2.42. For pairwise differences (2c? 
= 2) and equal replications (nj = n), Equation (3.5a) reduces to 


V(¥i-Yi) = 282/n. (3.5¢) 


From Equation (3.5b), the 100(1 —a)% simultaneous confidence interval for all paired differences (u;—;) (for 
all i and j) is 


(y¥;-y;) + V(t—1).F(a;t—1,v). (2s?/n). (3.5d) 


Equation (3.5d) can be used to test the significance of the difference between two means yy, and p;. We declare 
these to be different if the sample means y; and Y; differ in absolute magnitude by an amount exceeding 


S = V(t—1).F(a;t—1,v).(2s?/n). (3.5e) 


For t=2 treatments, S above is identical with the LSD since \/F(a;l,v) = t(a,v). Using the relationship 
between hypothesis testing and interval estimation, we can test the general null hypothesis Hy:2eu; = d 
(specified) by seeing whether d falls inside or outside the interval given in Equation (3.5b). 

For the previous numerical example, taking a = .05, we have S = V6 x 2.42 x 2(79.64)/6 = 19.63. Two 
treatment sample means will be declared significantly different at the 5% level if their difference exceeds 
19.63 in magnitude. (Note that this least significant difference is even larger than Tukey’s HSD = 16.25. This 
is a general result. Tukey’s procedure is preferred over Scheffé’s for pairwise comparisons, but for general 
contrasts Scheffé’s method gives a shorter interval.) Application of Scheffé’s procedure to the previous 
numerical example gives the following results: (A,B,C,D,E) and (B,C,D,E,F,G). There are only two 
significant differences (G—A and F—A), compared to three differences from the Newman-Keuls and the 
Tukey procedures. 

Equations (8.5a) and (8.5b) are directly applicable to situations where the sample means have unequal 
variances because of unequal replications, assuming that single observations are uncorrelated and have equal 
variances. For situations where the unequal variances of the sample means also may be caused by observa- 
tions from the different treatments having unequal variances, Brown and Forsythe (1974) replace Equation 
(3.5a) by 2(c4s%/n,), where s} is the sample variance of the i-th treatment, and F (a; t—1,v) in Equation (3.5b) is 
replaced by F(a; t—1,f), where fis obtained using Satterthwaite’s result on the d.f. of a linear combination of 
sample variances, as follows: 


mm P| ee 


i = (s3#/n,)/(s?/n,). 


For another approximation, see Spjgtvoll (1972). 

If the sample means are correlated, Equation (3.5b) will still hold but Equation (3.5a) must be modified to 
include the covariances of the sample means, as in Equation (3.5f). 

Scheffé’s method can be directly generalized to linear model situations, expressible in matrix notation as 
y = XB +e. This covers both multiple regression and analysis of variance models higher than just the one-way 
classification. The contrast C = Xc,B; will be estimated by C = Scib,, where the b,’s are the least squares 
estimates of the B,’s. The estimated variance of C is 


V(C) = & Sec; (estimated covariance of b;,b,). (3.5f) 


Most regression computer programs (e.g., the SAS package put out by North Carolina State University) 
include the estimated covariances of the estimated regression coefficients as part of the output. Equations 
(3.5b) and (3.5f) may now be used to construct simultaneous confidence intervals for linear contrasts or to 
make multiple comparisons among the £’s. 


3.6. Duncan’s Methods 


Of the several procedures that D.B. Duncan proposed between 1941 and 1975, we shall discuss only 
two—his most popular (multiple range test) and his most recent (Bayesian k-ratio LSD rule), which he hopes 
will supplant the former. 


3.6.1. Multiple Range Test 


This method assumes homoscedastic (equal variances) and uncorrelated means. It is very similar to the 
Newman-Keuls procedure, except that the protection level at each testing stage varies with p, the number of 
means whose range is being tested for significance. Duncan’s rationale for decreasing the protection level as p 
increases is as follows. In experiments (factorial or otherwise) where the (p—1) degrees of freedom for the p 
treatments are partitioned into single degrees of freedom to correspond to (p—1) mutually orthogonal 
contrasts, the experimenter has no qualms about testing each contrast at the a level. Assuming for simplicity 
that the number of degrees of freedom for the error mean square is infinite (or quite large), the (p—1) F-ratios 
are statistically independent (almost). Therefore, the probability of rejecting one or more contrasts, if all p 
means are equal, is 
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a, = 1 - (1-2) (3.6a) 


Duncan (1955) modifies Newman-Keuls’ multiple range test by using a variable level a, as the significance 
level when testing the range of p means. As an illustration, with p = 9 equal means and a = .05, the 
probability of incorrectly rejecting one or more of 8 orthogonal contrasts is 1 —(.95)8 = 1—.6634 = .3366. This 
large probability of Type I error makes Duncan’s multiple range test very powerful (large probability of 
detecting differences when they exist). Experimenters are often more interested in finding than in not 
finding significant differences among the treatments being tested. For this reason, Duncan’s procedure 
received widespread acceptance among research workers, particularly in the agricultural sciences. As 
originally proposed, no preliminary significant overall F test is required. To overcome, somewhat, the 
objection of a possibly large Type I error probability, we may conservatively require a significant overall F 
test as a necessary condition for the application of the multiple range test. 

In the Newman-Keuls procedure, the yardstick for testing the significance of the range of p means is W, 
= q(a;p,v) Vs2/n. In Duncan’s procedure, the yardstick is similar, except that a is replaced by ap, defined by 
Equation (3.6a), giving the following “shortest significant range” criterion: 


R, = q(a,;p,v) Vs?2/n. (8.6b) 


Thus, no special tables are required if we have extensive tables of q(p,v), the distribution of the studentized 
range of p means and p d.f. However, the percentiles a, are “awkward,” being equal, for example, to .05, 
.0975, .1426, .1855, .2262, and .2649 ifa = .05 and p = 2,3,4,5,6, and 7, respectively. For this reason, Duncan 
(1955) tabulates q(a,;p,v) for a = .05 and .01; p = 2(1)10(2)20,50,100; and v = 1(1)20(2)30,40,60,100, and ~. 
More accurate and more extensive tables are given in Harter (1960), reproduced in Harter (1970). A 
condensed table of q(a,;p,v) is given in the appendix as Table C, in Steel and Torrie (1960), etc. 

To apply the method, we arrange the means in ascending order and test each pair against R,, starting 
with the extremes. Once two means are declared to be not significantly different, we underline them and no 
further testing is made between means underscored by this line. Applied to the previous example with t = 7 
means, v = 30 d.f., s? = 79.64, and each treatment equally replicated n = 6 times so that Vs?/n = 3.643, we 
have: 


p: 2: oF 4, 5, 6, 7 
q(.05,;p,30): 2.89, 3.04, Pay, 3.20, 3.25, 3.29 
R,=3.648q : 10.53, 11.07, 11.97, 11.66, 11.84, 11.99 
The results of the test are: 
A B (@ D E F G 
49.6 58.1 61.0 61.5 67.6 71.2 1A} 


In these results, G — A = 21.7 > R,, the shortest significant range for 7 means; G — B = 13.2 > Ry; G —C = 
10.3 < R;, so we underline G through C and make no comparisons among C,D,E,F, andG. F — A = 21.6 > Rg; 
F — B = 13.1 > Rs, and we need not test F — C, etc.; E — A = 18.0 > R,; E — B = 9.5 < Ry, so underline B 
through E; D — A = 11.9>R,; C— A = 11.4 > R3; and finally B — A = 8.5 < Ry, so underline A and B. Thus, 
the method gives seven significant differences (GA, GB, FA, FB, EA, DA, CA), compared to three 
significant differences from Newman-Keuls’ test. 

One disadvantage of this procedure is that it is not amenable to simultaneous interval estimation. If we 
use (y; — y;) + R, as the confidence interval for (4; — ;), some pairs of means will have confidence intervals of 
different widths, even though all treatments are equally replicated. 

In a sense, Fisher’s LSD, Newman-Keuls’ MRT, and Tukey’s HSD are particular cases of Duncan’s 
MRT. If in Equation (3.6b), we put a, = a and p = 2, we obtain Fisher’s LSD. Tukey’s HSD is obtained by 
putting a, = a and p = t; and substitution of a for a, gives the Newman-Keuls’ MRT. 

If the sample sizes are unequal, Bancroft (1968) suggests using the harmonic mean of the sample sizes 
(reciprocal of the arithmetic mean of the reciprocals of the sample sizes): 


Nie (ye Ff no? Sw. FOE. 
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Kramer (1956) suggests replacing s?/n (the common variance of the sample means) in Equations (3.3), (3.4a), 
and (3.6b) by the average of s?/n, and s?/n;, the variances of the two sample means being tested. Equation 
(3.6b) becomes 


R, = q(ap;3p,v) Vs7[(1/n;) + (1/n;))/2. (3.6¢) 


Kramer (1957) extends the procedure in an obvious manner to correlated as well as heteroscedastic 
means, where the variance of ¥; is co, that of y; is ¢;;0?, and their covariance is c,;o?. The coefficients cj, ¢jj, 
and ¢,, are known, but o? is unknown and is estimated as usual by the error mean square with, say, v.d.f. from 
the analysis of variance. (This does not handle the situation where the unequal variances of the means are due 
to observations from the different treatments having unequal variances. The correlation between the means 
may be due to an incomplete block design or a covariate being used in the analysis.) If y; and y; are the 
extremes of p ranked treatments, then we declare these treatments to be different if their difference exceeds 


q(app,v) Vee, — 2ey + Gy)S2 (3.6d) 


in Duncan’s test, and similarly for the Newman-Keuls or the Tukey tests. Note that if the means are 
uncorrelated, ¢,; = 0, ¢; = 1/n;, and ¢;, = 1/n;, so that Equation (3.6d) reduces to (3.6c). 

Kramer’s extension of the test to correlated and heteroscedastic means is approximate; and it is also 
conservative, in the sense that it tends to declare two means equal when they are not. Duncan (1957) proposes 
a more powerful test, which imposes a further condition for a subset of means to be declared homogeneous. 


3.6.2 Bayesian k-ratio t (LSD) Rule 


In Fisher’s protected LSD method, the result of the overall F test for treatment effects is used only ina 
go, no-go fashion. In Duncan’s Bayesian k-ratio t or k-ratio LSD rule, the observed value of the F test statistic 
actually is used in calculating the LSD or the critical t value for comparing two means. If the F ratio is large 
(indicating heterogeneous treatments), the critical t value is reduced, thereby increasing the power of the 
test; and if the F ratio is small (indicating homogeneous or nearly homogeneous treatments), the critical t 
value is increased, making it more difficult to declare two treatments to be significantly different and thus 
decreasing Type I error probability. Duncan (1975) summarizes his earlier work (1961 and 1965) and that of 
his former doctoral students (Ray A. Waller and Dennis O. Dixon) at The Johns Hopkins University in 1969 
and 1974. 

The k-ratio t test is based on an EBALEP (empirical Bayes, additive losses, exchangeable priors) 
approach. The sample mean y; is, of course, a random variable, usually assumed to be normally distributed 
with mean py and variance o2/n. In Bayesian statistical inference, the population means ;, f2,. . ., also 
are regarded as random variables, with a prior distribution that usually is assumed to be normal with some 
mean py and variance a2. (This may well be true experimentally and not merely conceptually, if the t 
treatments correspond to t varieties, say, randomly selected for field testing from a larger collection of 
varieties.) The term “empirical Bayes” comes about through having to use the data to estimate the parame- 
ters of the conceptual superpopulation of populations. If L, is the loss incurred when the i—th decision is 
erroneous, and similarly with L,, the additive losses assumption states that the loss incurred is L; + L;, if both 
the i—th and the j-th decisions are incorrect. Finally, the exchangeable prior distributions assumption states 
a priori the comparisons are “equally plausible.” This rules out, for example, the case where the t treatments 
form a p x q factorial (where a priori comparisons of main effects are more likely to be significant than 
interaction effects) or where the t treatments correspond to t levels of a quantitative factor, where we may a 
priori expect an ordering of the true treatment means p, Sw. S. . . S py. (Of course, these two cases fall 
under Ch. 2, and no multiple comparison technique is appropriate. ) 

A novel feature of the test is the use of the ratio (denoted by k) of the relative seriousness of Type I to 
Type II] errors. By considering the case of t = 2 treatments (where no multiple comparison problem exists), 
the critical value in the regular Student’s t test at a given a level can be made approximately equal to that in 
the k-ratio t test for some value of k. In round figures, the approximate correspondence between a and k is: 


Gf LON, 0559.01 
k: 50, 100, 500. 
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Therefore, Duncan recommends that k be taken to be equal to 100 or 500, where an experimenter previously 
used to test at the 5% or the 1% level, respectively. 

Any difference d between two means or, more generally, any contrast c among the means is significantly 
different from zero if the ratio d/s, or c/s, exceeds some critical value t(k,F,t,v), where sg = 1/2s?/n, and s? is 
the error mean square with v degrees of freedom, n is the constant number of replications of each treatment, 
and F is the observed F ratio for treatments from the analysis of variance table. (The estimated variance sz 
of a contrast is given in Equation (8.5a).) As indicated above, the critical t value depends on the four 
arguments k, F, t, and v. (Unfortunately, we have used the same letter t to denote two entirely different 
things—the total number of treatments in the experiment and the t test or distribution.) Its dependence on F 
is awkward for tabulation because of the uncountably infinite number of values that F can take, making 
interpolation almost inevitable in each application. There is also no easy or explicit formula for calculating the 
critical value. It is the solution of an extremely complicated integral equation, which appears as Equation 
(3.15) in Duncan (1975). Table D in the appendix gives the critical values for the k-ratio t test for k = 100 and 
500, taken from Waller and Duncan (1972). For interpolating with respect to F, Waller and Duncan (1969) 
recommend linear interpolation using a = \/1/F for F <2.4, except when q > 100 and v > 60; otherwise, we 
use b = V/F/(F' -1), for F > 2.4, except when q <20 and vy <20, where q =t—1. Whena cannot be used, 0 is 
used, and vice versa. Interpolation with respect to q and v should hardly ever be necessary. If needed, the 
recommendation is to interpolate using q and 1/v. Values of a and b are included in Table D. 

For large experiments (large number t of treatments and large number v of d.f. for error), the critical 
values may be approximated as follows, with b already defined above: 


t(100, F, ~, ©) = 1.72 b(for k = 100) 


t(500, F, «, «) = 2.23 b(for k = 500) (3.6e) 


Duncan (1965) considers Equation (3.6e) to give adequate approximation if t => 15 and v= 30, Equation (3.6e) 
shows that for large F (sign of heterogeneous treatments), two means will be declared different if their 
studentized difference (d/s,) exceeds only 1.72 (for k = 100, corresponding to a = .05), while for a small F = 
1.5, say, the critical value is raised to 1.72 ./1.5/.5 = 2.98, reducing the probability of Type I error. 

In the numerical example we have been considering, t = 7 treatments, error mean square s, = 79.64 with 
v = 30 degrees of freedom, F = 4.61, and standard error of a difference sy = \/2s7/n = 1/2(79.64)/6 = 5.15. For 
k = 100, q = t—1 =6, and v = 30, Table D gives t = 2.16 for F = 4.0 (andb = 1.155) and t = 2.02 for F = 6.0 
(and b = 1.095). Interpolating for F = 4.61 (and b = \/ 4.61/3.61 = 1.130), we get the critical t value as t(100, 
4.61, 7, 30) = 2.02 + (2.16 — 2.02) (1.180 — 1.095)/(1.155 — 1.095) = 2.02 + .08 = 2.10. (If we had interpolated 
directly with respect to F, instead of the recommended b = \/F/(F —1), the calculated value of t would be 2.12. 
Although t = 7 is too small to be regarded as infinite, use of Equation (3.6e) gives a calculated t of 
1.72+/4.61/3.61 = 1.72 (1.13) = 1.94.) Instead of dividing each difference by its standard error sq and 
comparing it with the k-ratio t value, it will be more convenient computationally to multiply the t value by sq 
to give the corresponding k-ratio LSD = 2.10 (5.15) = 10.82 for the present problem. Any two means differing 
by more than 10.82 will be declared different. The results are as follows, being identical to those obtained by 
using Fisher’s LSD method. 


49.6 58.1 61.0 61.5 67.6 (al 71.3 
A B Cc D E F G 


The LSD’s (in multiples of sg) from the procedures for 7 treatments and 30 d.f. for error are: 


LSD/sq 

Fisher’s 2.04 

Newman-Keuls’ MRT (q(a;p,v)/ V2) 2 408 AT wu. BAD 
Tukey's HSD a5 

Tukey’s MRT 2.60—3.15 

Scheffé’s 3.81 

Duncan’s MRT (q(a,;p,v)/ V2) 20452 lie ABS 
Duncan’s k-ratio t test (for an observed F = 4.61) 2.10 
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This tabulation shows that Duncan’s k-ratio LSD rule is almost as powerful as Fisher’s LSD, without the 
latter’s higher Type I error probability, for if Hy were true, the observed F would have been smaller (equal to 
2.4, say) and from Table D, the critical value for t would have been 2.42. If the treatments are very 
heterogeneous, the k-ratio LSD rule can be more powerful than Fisher’s LSD. If F = 10, for example, the 
critical k-ratio t value is 1.93, compared to 2.04 for Fisher’s LSD rule. 

The k-ratio t test is adaptable for simultaneous interval estimation. Following Fisher’s, Scheffé’s, and 
Tukey’s methods, one would expect the k-ratio confidence interval for 6 = (41; — 4) to be (d = y; —y,) + k-ratio 
LSD, where the LSD = t(k,F,t,v)sq, but this is not so. Besides the four parameters k, F, t, v, the LSD in the 
interval estimation problem also depends on the observed value of t = d/sg. Unfortunately, tables are not 
available at present. We refer the reader to Duncan (1975) and Dixon and Duncan (1975) for details. A large 
sample solution for the limits is as follows: 


[S,, 6y] = (1-(/F)]d + V1-(/F)sgt(k, ©, %, ©), (3.6f) 


where t = 1.72 (for k = 100) and 2.23 (for k = 500). Note that the point estimate of 5 = (u, — w;) is 1—-(/F)1 (9; 
—y;). Dixon and Duncan (1975) think that the preceding large sample approximation is adequate if t > 16, v = 
60, and F = 6. 

Another approximation that assumes only a large observed F value (with finite t and v) is the following: 


[d., dy] = d BLE Sat(k, eh t, v). (3.6g) 


The values of t(k, ~, t, v) are independent of t and are obtainable from the last row in Table D in the 
appendix for k = 100 and 500. 


3.7. Studentized Maximum Modulus Procedure 


All the procedures so far discussed for simultaneous interval estimation are for contrasts among the k 
means (or paired differences in particular). Sometimes, the experimenter may wish to construct simultane- 
ous confidence intervals for the population means themselves. Assume that all the sample means yj, yo,. . ., 
y; are correlated equally with correlation coefficient p and with possibly unequal variances d,o?, d,.o?,. . ., 
d,o?, where the d’s are known constants. Ifs? is the usual unbiased estimate of o? with v degrees of freedom, 
the probability is y = (1—a) that py; lies within y, + u(t, v, p; y) djs, for alli =1,2,. . . , t simultaneously, 
where u(t, v, p; y) is the two-sided (100 y)% point of the maximum absolute value of the t-variate Student’s t 
distribution with v degrees of freedom and common correlation p. (Constructing a 100(y)'/*% confidence 
interval for 4; independently of the others, using data from the i—th sample only, is not efficient.) 

This technique can be extended to linear combinations of the means (not necessarily contrasts). The 
probability is (1 —a@) that Seu; lies within Xe,y, + u(t, v, p; y) slXe; Vd,\for all (uncountably infinite) sets of 
constants (¢c;, ¢2,. . ., ¢). Values of u(t, v, p; y) are given in Hahn and Hendrickson (1971) for p =0, .2, .4, .5; 
y =.90, .95, .99; t = 1 (1)6(2)12, 15, 20; vy = 3(1)12, 15(5)30, 40, 60. Table E in the appendix gives the values of 
u(t, v, 0; y). Use of Table E in cases where p+ 0 gives conservative results. The values for p ~ 0 are smaller 
than corresponding ones with p = 0. 


3.8 Comparisons Against a Control 


3.8.1. Dunnett's Method 


In experiments comparing t treatments, one of the treatments quite often is a control (check or 
untreated). In these experiments, we could partition the (t—1) d.f. for treatments into 1 d.f. for comparing 
control against the average of the other treatments and (t —2) d.f. for comparisons among the (t—1) “real” 
treatments. If these (t—1) other treatments are significantly different, the 1 d.f. comparison between their 
average and the control may not be meaningful. The experimenter may wish to compare the control with each 
of the other (t —1) treatments (and not with their average). Duncan’s k-ratio t test is not applicable here since 
the exchangeable priors (or equally plausible comparisons) assumption is not satisfied. (The difference 
between a control and a treatment is a prio? likely to be larger than that between two treatments.) Dunnett 
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(1955) gives a procedure for the simultaneous interval estimation or multiple comparisons of the control with 
each of the others, with an experimentwise error rate. A treatment and a control are declared different if 
their means differ by more than t(a; q, v)sqa, where sq is the standard error of a difference, q = (t—1) is the 
number of treatments other than control. Values of t(q; q, v) are given in Dunnett (1964) and reproduced for 
both one-sided and two-sided tests in Table F of the appendix. If we are comparing insecticides, for example, 
and the control is a standard one, two-sided tests would be proper since we do not know a priori if the new 
insecticides would be better or worse than the standard insecticide. More extensive tables of \/2t(a; q, v) for 
one-sided tests are given in Gupta and Sobel (1957) for up to 50 treatments. 

To illustrate the method, suppose that variety A in our numerical example is a standard variety, thus 
calling for two-sided tests of A against each of the others. From Table F,, with 30 d.f. for error and q = 6 other 
treatments besides control, the critical t value in a 5% two-sided test is t(.05; 6, 30) = 2.72. The standard error 
of a difference is sy = \/2s2/n = 5.15. The LSD between control and each of the others is LSD = 2.72 (5.15) = 
14.0. Since the mean of A is 49.6, any variety will be different from A, if its mean is at least 49.6 + 14.0 = 63.6. 
The result is that B, C, and D are not different from A, but E, F, G are better than A. The two-sided interval 
estimate of the difference between a standard variety and any other variety is their observed mean difference 
ga 450). 

The preceding discussion assumes equal replications. If the control is replicated n, times and the i-th 
treatment is replicated n, times, we define sg = \/s?[(1/n,) + (1/n\)], which reduces to the previous definition if 
all replications are equal. More generally, if within treatment variances are not homogeneous, we define sq = 
V(s2/n,) + (s#/n,) and use Satterthwaite’s result for getting the d.f. of a linear combination of mean squares. It 
may suffice to calculate only two error mean squares, one for within control and the other for within other 
treatments. For arefinement, see Dunnett (1964). Dunnett’s paper also gives the following optimal allocation 


of experimental units. If n; =n, =. . .= m_, =n, say, we should take n, = n /t—1. Bechhofer (1969) 
generalizes this result to the case where the variances are unequal but their ratios 07/02 (i =1,2,. . ., t—1) 
are known. 


Robson (1961) extends Dunnett’s procedure to the case of a balanced incomplete block design, giving rise 
to correlated treatment means. 


3.8.2 Gupta and Sobel’s Method 


Using the statistic in Dunnett’s method, Gupta and Sobel (1958) give the following procedure for 
selecting all treatments that are as good as or better than the control or standard treatment. The procedure 
guarantees a probability of at least (1 —q) that the selected subset of treatments contains all treatments that 
are at least as good as the control. The rule is to include in the subset all treatments whose means y; exceed 
that of the control Y) by the amount 


(yi — Yo) = — tla; q, v)Sa, (3.8a) 


where t(a; q, v) is the one-sided critical value in Dunnett’s test. 

In using Equation (3.8a) as the criterion, we throw away treatments that are significantly worse than 
control. Treatments whose sample means are slightly less than those of control (so that ¥; — Yo will be slightly 
negative) will be included in the subset. If we use Dunnett’s test as a screening procedure, we declare the 
i—th treatment to be as good as or better than control if 


(¥; — Yo) = + tla; q, v)Sa. (3.8b) 


Comparing Equations (3.8a) and (8.8b), it is obvious that Gupta and Sobel’s procedure will give a larger 
subset of treatments. Dunnett’s method retains only those treatments that have proved themselves superior 
to control, while Gupta and Sobel’s method discards only those treatments that have proved inferior to 
standard treatment. 

Gupta and Sobel (1958) also discuss other related problems—comparing variances and binomial parame- 
ters. 

Sobel and Tong (1971) consider the optimal allocation of observations for partitioning a set of normal 
populations in comparison with a control. 
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3.8.3 Williams’ Method 


Williams (1971) considers the case where the t treatments are t levels or doses of some substance, with 
the control corresponding to zero dose. This situation was discussed in Section 2.3.1, where the recommended 
analysis was either to compare zero against the average of the nonzero doses and fit a regression to the q = 
(t—1) nonzero levels or to fit a curve through all t doses (including zero). Williams claims there are 
circumstances in which the experimenter may not wish to fit a curve to the t doses. He may wish, instead, to 
compare zero dose against each of the other doses. As an example, he cites toxicity studies in which the aim of 
the experiment may be to determine the lowest dose at which there is activity. (The assumption is that the 
response is zero up to this “lowest dose” and increases thereafter, instead of continuously increasing from 
zero, slowly at first and more rapidly afterwards.) Another reason for not wishing to fit a curve may be the 
experimenter’s unwillingness to assume a particular form (logistic, etc.) for the response function. The 
number of levels is usually very small (8 to 5), making model fitting rather difficult. 

Dunnett’s procedure may be used to compare zero with the other doses, but some power is lost in not 
making use of the structure in the treatments. Williams assumes a nondecreasing response function so that fo 
<p, <... <p, ifthe treatments To, T;, T.,. . ., T, are in increasing order of dosages. (If, say, the third 
dose (i.e., second nonzero dose) is the level at which activity first becomes noticeable, we have fy = MW1< MoS 

. <pq-) The first step in Williams’ test is to estimate w;(i =0,1,. . . , q). Because of the constraints on the 
’S, M4; is not necessarily estimated by y;, the sample mean. Bartholomew (1961) gives the following maximum 
likelihood estimates of the p’s. If¥p5 <¥, <Vo <. . . <Yq, then pw; =Jj (i-e., uw; is estimated by y;). Otherwise, 
there is at least one i for which y; > yi,,. We replace both jy; and j;,, by their weighted average 


Vier = (my; + nies Yi v/(y +ni41), 


where n; is the number of replications of treatment or dose i. We now have only q means Yo, Yi,- - - , Yi-1 
Viitty Virz) + + +» Yq: If these means are in nondecreasing order, we stop and estimate yu; by y; (for j = 0, 1, 
.,1-1,i+1,. . .,q)and estimate both y; and p44, by y;,;,,. Otherwise, we repeat the averaging process, 


giving yii+1 a weight of (n, + nj,,). For instance, if Yii+1 > Yi+2., we average them to give 
Yuisrize = (My + My WViier + NiveViv2 Vy + Mid + nize) 


as the common estimate of uj, Wi41, aNd fi+2, if the sample means are now in correct ascending order. 
We now have the estimated population means fio, /i,,. . ., &q, Where some of these may be equal, from 
the averaging process. Assuming equal replications for all doses (including zero), we now test 


tp = (fp — Yo V2s?/n, (3.8¢) 


taking p =q,q-1,. . ., 1 in this order, stopping as soon as we get a nonsignificant result. We declare the 
p—th nonzero dose to be different from control if t, above exceeds the critical value t(a;p,v), given in Table G 
in the appendix. (Note that for simplicity of statistical distribution, we test 4, against the unadjusted sample 
mean yy and not against iy, even if 1 is not estimated by yy.) Of course, we can apply the test in the following 
alternative way. Declare wu, and fy different if 


(hip — Yo) > tla;p,v)8q. (3.8d) 


Williams (1971) gives an example of a randomized block experiment with 8 blocks and t = 7 doses (zero 
and q = 6 nonzero doses), and an error mean square s* = 1.16 with v = 42 d.f. The observed means are yy = 
10.4, y, = 9.9, Yo = 10.0, y; = 10.6, y, = 11.4, y,; = 11.9, and y, = 11.7. The effect of the substance in the 
experiment, if anything, can only increase the mean of the response. Since yy) > y,, we average these to give 
Yo, = (10.4 + 9.9)/2 = 10.15, and because this average exceeds Yo, we form the weighted average Yo,1,2 = (2Yo.1 
+ y,)/3 = 10.1. Since y; and y, are not in the correct ascending order, we average them to give y;,, = 11.8. We 
thus have the following estimates of the population means. 


fio = Hy = Be = You = 10.1; fs = y; = 10.6; wy = yy = 11.4; 
fs = fle = Ys, = 11.8, 


The standard error of a difference is sg = \/2(1.16)/8 = .539. For a test at a = .05, Table G gives the following 
critical values for 40 d.f. 
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= p he 6, 5, 4, 3; Py 1 
t(.05;p,40): ileill 1.80, 1.80, eis): 1776; 1.68 
t(.05;p,40)sq: .98, no te Rot .96; =00) agi 


Applying equation (3.8d), 


fig — Yo = 11.8 — 10.4 = 1.4 > .98; conclude pg > Ly. 
fis — Yo = 11.8 — 10.4 = 1.4 > .97; conclude ps > po. 
fig — Yo = 11.4 — 10.4 = 1.0 > .97; conclude pr, > po. 
fs — Yo = 10.6 — 10.4 = 0.2 < .96; conclude pig = po = fy = Mo. 


The conclusion is that the fourth nonzero dose was the lowest dose at which response was observed. 
Williams (1972) extends the procedure to handle the case where the zero dose has a different (larger) 
number of replications than that of the nonzero levels, for both one-sided and two-sided tests. 
In general, we would recommend the regression approach of Section 2.3.1. Suppose we have the 
following results: 


Dose: 0 1 2 3 4 5 
Response: 5 7 10 15 25 40 


Using the present procedure, we may conclude that treatment is first effective at dose 3. The author would 
rather believe that the response is increasing continuously from dose 0, gradually at first and more rapidly at 
higher doses. We might fit a curve and estimate the lowest dose at which the response will be at least y* say. If 
higher doses are more expensive and cost is a consideration, we could adjust the response to a per dollar basis 
and estimate the dose that will produce the highest adjusted response. 


3.8.4 Sequential Methods 


See Dudewicz, Ramberg, and Chen (1975) for a two-stage procedure when variances are unequal and 
unknown, and Paulson (1962) for a sequential procedure, assuming equal variances. In the latter, inferior 
treatments are dropped at each stage. 


3.9. Miscellaneous Methods 


In this section we shall discuss briefly various related techniques or merely cite their references. 


3.9.1 Bonferroni Procedure for Preselected Contrasts 


Tukey’s and Scheffé’s methods enable us to construct confidence intervals for an infinite number of linear 
contrasts among the t means so that the probability is (1 — a) that they are all simultaneously true. Usually an 
experimenter is only interested in a rather small subset of m contrasts, say. If these m contrasts are 
preselected and not suggested by the data, Dunn (1961) recommends the usual method based on the Student’s 
t distribution to construct an interval for each contrast independently, with confidence coefficient 1 — (a/m), 
so that from Bonferroni’s inequality, the overall or simultaneous confidence level for all m contrasts is at least 
(1 — a), as in Fisher’s unprotected LSD. Two-sided (100 a/m)% points of the t distribution are given in the 
paper and reproduced in Table A in the appendix. In the notation of Section 3.5, the confidence interval for 
each contrast is 


C + t(a/m:v) WO), (3.9a) 


where t(a/m;v) is the two-sided (100 a/m)% point of the t distribution with v degrees of freedom. These 
intervals often will be narrower than those given by Tukey’s or Scheffé’s methods. See also Schafer and 
MacReady (1975). 


3.9.2 Gabriel’s Simultaneous Test Procedure (STP) 


Gabriel (1964, 1969a) gives a procedure for testing the homogeneity of the (2' — t — 1) subsets (with at 
least two means) from a set of t means. Let P be any subset containing at least two treatments and S?2 be the 
treatment sum of squares for those treatments in P. These treatments will be declared to be different if 
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S2 > (t—1)s?F (a;t—1,v), (3.9b) 


where s? is the error mean square with v d.f. from the analysis of variance of the complete data (with t 
treatments), and F(a;t—1,v) is the upper (100 a)% point of the F distribution with (t—1) and v d.f. Note that 
the critical value of F in Equation (3.9b) is that for the complete data so that the righthand side is identical for 
all subsets. 

The error rate is experimentwise. If Hy is true (all t means are equal), the probability is only a that one or 
more of the (2' — t — 1) subsets will be declared incorrectly to be heterogeneous. The procedure also has the 
following nice property. Any set containing a significant subset is itself significant. (However, the converse is 
not necessarily true, and it is possible for a significant set to contain no significant proper subsets.) Because of 
this property, it is not necessary to test all subsets. For example, ifthe set (A, B, C)is significant, the set (A, 
B, C, D) will be significant; and if (E, F, G) is not significant, the subsets (E, F), (E, G), and (F’, G) also will be 
not significant. 

The 1964 paper has a numerical example. Tukey’s HSD method, which is conservative compared with 
Newman-Keuls’ or Duncan’s multiple range tests, found two significant pairs. Gabriel’s STP and Scheffé’s 
test found all subsets of two means (i.e., all paired differences) to be not significant. Generally, a set P will be 
declared significant by Gabriel’s STP if and only if some contrast involving only those means ‘in P is judged 
significant by Scheffé’s procedure. 


3.9.3 Kurtz-Link-Tukey-Wallace Range Procedure 


The analysis of variance is based on sums of squares. For computational convenience, analogous 
procedures based on ranges are available. Kurtz, Link, Tukey, and Wallace (1965) give a similar shortcut 
procedure for multiple comparisons. This paper also has an interesting general discussion on the philosophy of 
multiple comparisons. 


3.9.4 Covariance Adjusted Means 


For multiple comparisons of adjusted treatment means in an analysis of covariance, see Kramer (1957), 
Halperin and Greenhouse (1958); Scheffé (1959, pp. 209-213); Bancroft (1968, Section 8.7); and Thigpen and 
Paulson (1974). 


3.9.5 Procedures for Two-Way Interactions 


Suppose that the t treatments are in the form of a p x q factorial, both factors being qualitative. The 
partitioning of the pq-1 degrees of freedom for the t = pq treatments is discussed in Section 2.2. Harter (1970) 
gives a procedure for comparing interaction effects of the form 


A,B, + A;B, — A,B, — A;B, = ((A; — A;)By)]—-[(A; — Ay) By] 
= [A\(By + B,)]—[A\(Bu > i Bay 


where A,B,, for example, is the mean for the i-th level of factor A and the u—th level of factor B. The 
preceding interaction is the difference between two differences; viz., (difference between the i-th and the 
j—th levels of factor A, both at the u—th level of B) minus (difference between the i—th and the j —th levels of 
A, both at the v —th level of B). As the second form of the expression shows, the interaction also can be written 
as the difference between the u—th and the v —-th levels of B at the i—th level of A minus the same difference at 
the j-th level of A. See also Dunn and Massey (1965), Sen (1969), Johnson (1976), and Bradu and Gabriel 
(1974). The last paper describes three methods for testing and simultaneous interval estimation. 


3.9.6 Nonparametric Methods 


In all the methods considered so far, we have assumed that the data are distributed normally. If we 
cannot or do not wish to make this assumption, we must resort to nonparametric methods for separating the 
means. See Steel (1959, 1961); Dunn (1964); Miller (1966, ch 4); Rhyne and Steel (1965, 1 967); McDonald and 
Thompson (1967); Tobach et al. (1967); Rizvi, Sobel, and Woodworth (1968); Sen (1969); Puri and Puri (1969); 
Slivka (1970); and Hollander and Wolfe (1973, Sections 6.3, 7.3, and 7.7). 
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3.9.7 Gupta’s Random Subset Selection Procedure 


In experiments where the scientist is looking for the best treatment (e.g., a plant breeder selecting anew 
variety for highest yield or resistance to some disease), multiple comparison techniques are inappropriate. 
We cited Gupta and Sobel (1958) in Section 3.8.2 for a method for selecting treatments that are as good as or 
better than a control or standard treatment. Some selected references on problems of selecting the best out of 
t treatments are Paulson (1964); Gupta (1965); Robbins, Sobel, and Starr (1968); Bechhofer, Kiefer, and Sobel 
(1968); Sobel (1969); Tong (1970); Rizvi (1971); Chiu (1974a, 19746); a review paper with 71 references by 
Weatherill and Ofosu (1974); Wackerly (1975); Santner (1975); and Gupta and Panchapakesan (1971). 

Selection problems may be posed in several ways, of which the following two are the most common. 

(a) Given 6* > 0 and P* <1, find a procedure that will, with probability of at least P*, choose the 
population with the largest mean if this mean exceeds the second largest mean by at least 5*. 

(b) Given 1/t < P* <1, find the smallest subset of the t treatments such that the probability is at least P* 

that the subset will contain the best population. 
The preceding formulations are referred to as the “indifference zone” and the “random subset” approaches, 
respectively. In (a), we are indifferent to all differences that are less than 6*; and in (b), the number of 
treatments that are included in the subset is a random variable. Decision theoretic approaches (minimax, 
Bayesian, etc.) are also possible. 

Gupta (1965) gives the following random subset solution. Include the i—th treatment in the subset if its 
sample mean jy; satisfies the condition 


Vi = Vmax. — t(ast,v)sq, (3.9¢) 


where t(a;t,v) is the one-sided critical value of Dunnett’s test statistic (Section 3.8.1). Values of t(a;t,v) are 
given in Table F1 in the appendix, with t = (q+1); e.g., if t = 7, we look under q = (t—1) = 6. 

In our numerical example, we have t = 7, v =304.f., sq = V2s2/n = 1/2(79.64)/6 =5.15, and Vmax. = 71.3. 
Taking a = 1—P* =.05, the value of t(.05; 7, 30) from Table F1 with t = 7 (or q = 6) is 2.40. From Equation 
(3.9c), we include in the subset all treatments whose means exceed 71.3 — (2.40) (5.15) = 71.3 — 12.36 =58.94. 
Thus, we are 95% confident that the set (C, D, E, F, G) will contain the best treatment (variety). 


3.9.8 Scott and Knott's Cluster Analysis Method 


If a scientist has collected a mass of data (usually multivariate), he may wish to know if these came from 
one or more populations. Ifthe latter, he would like to know into how many groups or clusters the data should 
be divided, and the best way of forming these groups. (For a recent paper and book on cluster analysis, see 
Kuiper and Fisher (1975) and Hartigan (1975).) With univariate data, we can arrange the observations in 
ascending order. If the data are 10, 11, 55, 56, 59, for example, they can be divided into two clusters in an 
obvious manner, namely (10, 11) and (55, 56, 59). In less clearcut situations, an objective criterion for 
grouping is required. If we know that the data came from two populations only, we can form the two groups 
by maximizing the sum of squares between the two groups (or equivalently, such that the sum of the within 
groups sums of squares is a minimum). With t observations (or means), we need only consider the (t —1) 
possible partitions formed by dividing between two successive ordered means. The multiple range tests we 
have considered do, in fact, group the means, but they allow a particular mean to be in more than one group. 
Duncan’s test, for example, groups the means in the example into (A, B), (B, C, D, E), and (C, D, E, F, G). 
Tukey (1949) was the first to consider forming nonoverlapping clusters by looking at the gaps in the ordered 
means and testing their statistical significance, but he retracted this procedure in his 1953 manuscript 
(circulated privately) on the problem of multiple comparisons. 

Scott and Knott (1974) propose the following sequential partitioning and testing procedure. Arrange the 
t = 7 means in ascending order, denoted by A, B, C, D, E, F and G, respectively. Partition these into two 
groups, using the above criterion. Suppose this results in (A, B, C, D) and (E, F, G) as the two groups. Now 
test the null hypothesis Hy: uw; = uw. =. . . = mp; against the alternative hypothesis H,: uw; = m, or my. 
(Presumably, the overall F test with 6 and v d.f. need not be performed. The usual F statistic tests Hy against 
the most general alternative that not all means are equal. The proposed procedure tests Hy against the much 
more specific alternative that all the means are either m, or m,, with at least one mean in each group, and, 
therefore, should be more powerful than the usual F test.) If Hy is rejected, we partition (A, B, C, D) into two 
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groups and test the equality of these groups. The procedure is similar for (E, F’, G). It is repeated until Hy is 
accepted. 

The test is as follows. We assume that the t means y,, Yo,. . ., Y, are uncorrelated and homoscedastic, 
which implies equal replications n, say. As usual, let s? be the estimate (with v d.f.) of the common variance o” 
of single observations. (In the completely randomized design, v = t(n—1).) Suppose that the partitioning 
criterion forms two groups with t, and t, =(t—t,) means. The groups G, and G, will contain nt, and nt, original 
observations, respectively. Let T, be the sum of the nt, observations in G,, and similarly for T,. In the usual 
analysis of variance computations, the between groups sum of squares is 


By = [T,?/(nt,)] + [T?/(nt2)] — [T?/(nt)], (3.9d) 


where T = (T, + T,). Under the null hypothesis, the maximum likelihood estimate of o? is 


t 
o2=(In > (¥; — y)? + vs?V(t + v), (3.9e) 
i=l 


where y =(¥, +. . . + Yprt. 


The test statistic is 
dh = 7(Bo/Go2)/[2(a7 —2)] = 1.3876 (Bo/G,2). (3.9f) 


The 95% points for the distribution of \ were obtained by simulation and were found to be approximated 
adequately, for practical purposes, by the chi-square distribution with vy = t/(m—2) = t/(1.1416) d.f. 

(The simulation also included the case with v = 0, for which the 95% points of \ were estimated to be 2.75, 
6.60, 12.11, and 21.74 for t = 2, 5, 10, and 20, respectively. This shows that we can test the homogeneity of t 
means, even when each mean is based on n = 1 replication. This is, of course, impossible with the usual F test 
and its general alternative hypothesis since the error mean square has zero d.f. As mentioned earlier, the 
present A test makes an extra assumption about the alternative hypothesis.) 

In our numerical example, t = 7, n = 6, s? = 79.64 with v = 30 d.f. (design being that of a randomized 
block experiment). The means in ascending order were 49.6(A), 58.1(B), 61.0(C), 61.5(D), 67.6(E), 71.2(F), 
and 71.3(G). To find the partition with the largest between groups sum of squares, we should try, theoreti- 
cally, the t — 1 = 6 possible partitions: (A, BCDEFG), (AB, CDEFG), (ABC, DEFG), (ABCDE, FG), 
(ABCDEF,G). In practice, we need try two or three possibilities only. (With a computer it is easy enough to 
try all (t—1) partitions.) In this example, (A, BCDEFG) and (ABCD, EFG) are the two most serious 
candidates. It can be shown that (ABCD, EFG) is the optimum partition. Here, t, = 4, t, =3, T; = 6 (49.6 + 
D6. Ts: 61.0 476125) = 3181.2 /to= 1260.6, Ter Tp = 2641.8, y =(49.6 +. . . + 71.3)/7 =62.9, and >(y, — 
y)? = 370.04. 


From Equations (8.9d) and (8.9e), By = (1881.2)?/24 + (1260.6)?/18 — (2641.8)?/42 = 1602.86 and 6? = 
[6(870.04) + 30(79.64) |/(7 +380) = 124.58. From Equation (3.9f), the test statistic is A = 1.3876 (1602.86/124.58) 
= 17.70. Using the chi-square approximation with v) = t/1.1416 = 7/1.1416 = 6.1 d.f., the value 17.70 is 
significant. (The 95% point of the chi-square distribution is 12.6 for 6 d.f. and 14.1 for 7 d.f.) 

We next have to partition (ABCD) and (EFG). In partitioning (HFG), t is now equal to three. Fort =3 
means, the optimum partition is at the larger of the two gaps, giving (E, FG) witht, =1, t, =2, T, = 405.6, T, 
= 855.0, 3(y; — Y)? = 8.8866, giving d2 = [6 (8.8866) + 30 (79.64) ]/33 = 74.02, By = 405.62/6 + 8552/12 — 
1260.67/18 = 53.29, and A = (1.876) (53.29)/74.02 = 0.99, which is not significant. The significance of the 
partition of (ABCD) into (A, BCD) is borderline. If we accept this as being significant, the final groupings are 
A, BCD, and EFG, which is what inspection of the means would suggest. 

For another cluster analysis approach to multiple comparisons, see Jolliffe (1975). 


3.9.9 Multivariate Populations 


We have so far considered univariate populations only. Quite often, we may collect several kinds of 
measurements from each experimental unit. For example, in comparing t brands of chocolate cake mixes, we 
may evaluate the resulting cakes with respect to each of p characteristics (flavor, aroma, texture, moistness, 
etc.). As another example, we may compare t treatments (storage conditions) for degreening lemons and take 
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color measurements on each of p dates. We may (and sometimes do) carry out p separate univariate analyses 
of variance, one for each of the p characteristics or dates, but we sacrifice some power in not making use of the 
correlations among the p characteristics. There is also a problem with the overall significance level in making 
p separate analyses. Preferably, we should perform one multivariate (p-dimensional) analysis of variance. If 
the null hypothesis of equal mean vectors (each population mean is now aset of p numbers)is rejected, we now 
have two different kinds of multiple comparison problems. With respect to which of the p characteristics do 
the populations differ? (In the preceding cake example, do the cakes differ in flavor only, in flavor and texture 
only, or in all p characteristics?) We do not, of course, have this problem in univariate (p = 1) situations. We 
have been considering the other kind of multiple comparisons in this report (viz., which populations differ 
from which). These comparisons are discussed in Kramer (1972, Section 5.11), Gabriel (1968, 19696), 
Krishnaiah (1969), Miller (1966, Chapter 5), and Morrison (1967, Section 5.4). 


3.9.10 Subset Selection Approach to Multiple Comparisons 


We mentioned in the last paragraph of Chapter 1 that hypothesis testing is usually almost totally 
irrelevant. Two treatments will be declared significantly different if they are sufficiently replicated. If two 
means are declared significantly different, many experimenters often are misled into thinking that the 
difference is of practical importance. Reading (1975) applies the indifference zone formulation of subset 
selection problems to multiple comparisons. The experimenter specifies three quantities: P(probability that 
all decisions concerning pairwise means are correct, an experimentwise probability), 5, dargest amount that 
two populations can differ and still be considered practically the same), and 5* (smallest amount by which two 
population means must differ to be considered definitely different). The interval (6,,5*) is the indifference 
zone. If two treatments differ by an amount in this zone, the experimenter does not care whether the 
treatments are declared different or the same. Given these three quantities, Reading gives tables for the 
necessary sample size and the critical value that must be exceeded for the difference between two means to be 
declared significant. Unfortunately, at present, the tables go up to t = 4 treatments only and assume that o? 
is known. 


3.9.11 Other Parameters and Populations 


In this publication, we have been comparing, estimating, or selecting normal populations with respect to 
their means. We conclude this chapter by citing selected references to similar work for other parameters and 
other populations. 

(a) Variances of normal populations. See David (1956), Ryan (1960), Bechhofer (1968), and Levy (1975a, 
1975b) for multiple comparisons; Jensen and Jones (1969) for simultaneous interval estimation; Gupta 
(1965), Ofosu (1975), and Arvesen and McCabe (1975) for subset selection. 

(b) Various kinds of simultaneous prediction intervals. Hahn (1970, 1972). 

(c) Regression coefficients. Duncan (1970) for multiple comparisons, and Hahn and Hendrickson (1971) for 
simultaneous interval estimation. 

(d) Subset selection for normal population with the largest (or smallest) a — quantile. Barlow and Gupta 
(1969). 

(e) Subset selection for normal population with the largest exceedance probability. Kappenman (1972) 
gives a method for selecting the normal population with the highest h, = P(X; >c), where X; ~ N(u, oi) 

_ and ¢ is a given constant. 

(f) Comparison of several independent treatment mean squares against a common error mean square. See 
Nair (1948); Hartley (1955); and David (1962, pages 155-156). 

(g) Subset selection for gamma populations. Gupta (1963). 

(h) Ranking and selection of binomial populations. Gupta and Sobel (1960), Ryan (1960), Taylor and David 
(1962), Paulson (1967), Bland and Bratcher (1968), Hoel and Sobel (2972), and Leonard (1972). 

(i) Multinomial populations. Goodman (1965) and Fienberg and Holland (1973) for simultaneous estima- 
tion; Bechhofer, Elmaghraby, and Morse (1959) for selection; and Gabriel (1966) for multiple compari- 
sons. 

(j) Subset selection for Poisson, negative binomial, and Fisher’s logarithmic distributions. Gupta and 
Panchapakesan (1971). 
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(k) Multiple comparisons of regression functions. Spjgtvoll (1972). 

(1) Multiple comparisons of logistic curves. Reiersgl (1961). 

(m) Selection of best treatment in paired-comparison experiments. Trawinski and David (1963). 

(n) Ranking of main effects in analysis of variance, variances of normal populations, and correlation 
coefficients of bivariate normal distributions. Eaton (1967). 

(0) Interval estimation of a ranked parameter. Alam and Saxena (1974). 

(p) Simultaneous interval estimation of contrasts among means of a multivariate normal population. 
Bhargava and Srivastava (1973). 

(q) Applications to multiple regression problems. Miller (1966), Morrison (1967, Section 3.6), Wynn and 
Bloomfield (1971), Hochberg and Quade (1975), and Tarone (1976). 


CHAPTER 4. CONCLUSION 


The findings from some Monte Carlo sampling studies that have been conducted to evaluate the relative 
performances of the various multiple comparison procedures are summarized in this chapter. Here, we 
assume that multiple comparisons are appropriate, ruling out situations covered in Chapter 2, where the 
proper statistical technique is the partitioning of the degrees of freedom for treatments into orthogonal 
contrasts. When it is not possible a priori to form meaningful orthogonal contrasts, it is assumed that the 
problem is really one of multiple comparisons and not of ranking and subset selection. A plant breeder who is 
interested in selecting a new variety should not be concerned with multiple comparisons of all possible pairs of 
varieties. 

Scheffé’s method is the most versatile. It allows unequal replications, correlated means from covariance 
adjustment, general contrasts (and not just paired comparisons), and simultaneous interval estimation. The 
penalty for this generality is reduced power (failure to detect true differences in testing and wide confidence 
intervals in interval estimation of differences between two means). Tukey’s HSD method also can handle 
general contrasts and interval estimation, but it requires equal replications and uncorrelated means. 
Duncan’s and Newman-Keuls’ multiple range tests are exact only for paired comparisons of uncorrelated 
means with equal replications and are not adaptable for interval estimation. The LSD easily can handle 
unequal replications, can be used for interval estimation, and can be extended in a simple and obvious manner 
to general contrasts. Duncan’s Bayesian k-ratio rule is too new to have found widespread acceptance by 
experimental scientists. Duncan is very enthusiastic about this procedure and, in a private communication, 
expressed the hope that his Biometrics 1975 paper “will mark the beginning of the end of all of the earlier (pre- 
1960) a-level multiple comparison procedures.” 

We refer the reader to Section 3.6.2, where we tabulate the LSD’s for the various procedures (in 
multiples of the standard error of the difference between two means). In ascending order, we have Fisher’s 
LSD, Duncan’s k-ratio rule, Duncan’s MRT, Newman-Keuls’ MRT, Tukey’s MRT, Tukey’s HSD, and 
Scheffé’s method. (Duncan’s k-ratio rule is data dependent. It may be more “reckless” than Fisher’s LSD or 
more conservative than Tukey’s HSD, depending on the observed value of the F ratio for treatments.) The 
above order is, therefore, in decreasing order of the number of paried comparisons that will be declared 
significant. If the objective is to find as many significantly different pairs as possible, Fisher’s LSD is best. 
The problem, however, is not this simple. 

There are two main difficulties in assessing the relative merits of the multiple comparison procedures. 
“In testing a hypothesis involving a simple two-decision situation, such as that to which the Neyman-Pearson 
theory is directly applicable, one compares two competing test criteria by fixing the Type I errors to be the 
same for both and compare the two power curves. Unfortunately, multiple-comparison procedures do not 
pertain to a single simple two-decision situation, but are special cases of multiple-decision procedures. At 
present there is no generally acceptable analytical method of comparing, in a manner similar to that for the 
two-decision situation, two competing multiple-decision test criteria.” (Bancroft 1968, p. 105.) 

Another difficulty is due to the different error rates used. Tukey’s and Scheffé’s methods use an 
experimentwise error rate, while Fisher’s LSD adopts a comparisonwise error rate. The multiple range tests 
of Duncan and of Newman-Keuls use different error rates, both of which are neither experimentwise nor 
comparisonwise. Duncan’s k-ratio rule does not even use the concept of error rate; it uses the ratio of the 
relative seriousness of the two types of errors. 
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Because of these difficulties, the procedures have been compared using Monte Carlo sampling methods 
only. There is a difficulty with such empirical sampling studies. It is easy to study the probability of Type I 
error (declaring two equal means to be unequal) because there is, of course, only one way in which t means can 
be equal. It is much more difficult to compare the probability of Type IJ error (declaring two unequal means to 
be equal), because t means can be unequal in many ways. They can be all unequal (equally spaced, clustered in 
two or more groups, etc. ), all equal but one, etc. It is unlikely that one method will be best for all patterns of 
inequality. 

Balaam (1963) was the first to publish results of a sampling study. He considered only four means, each 
with five observations, in eighteen configurations: (0,0,0,0), (1,0,0,0),. . ., (6,0,0,0); (1,1,0,0), (2,1,0,0), . . ., 
(5,1,0,0); (2,2,0,0), (8,2,0,0), (4,2,0,0); (8,3,0,0), (4,2,1,0), and (4,4,1,0). Three procedures (LSD, Newman- 
Keuls’, and Duncan’s MRT) were compared, each with and without a significant preliminary F test. The 
Newman-Keuls’ procedure was found inferior. The LSD was superior to Duncan’s MRT, in both protected 
and unprotected cases, but the difference in performance was small in the protected case. 

Boardman and Moffitt (971) compared five procedures (LSD, Scheffé’s, Tukey’s HSD, Newman-Keuls’ 
MRT, and Duncan’s MRT) for testing all possible pairs of means with respect to their Type I comparisonwise 
and experimentwise error rates. They carried out 30 sets of 10,000 sampling experiments witht =2,3,.. ., 
11 normal populations; samples of equal sizes n = 5, 10, and 15; and a = .05. 

For t = 10 treatments, and taking a = 5%, the Type I comparisonwise error rate for Duncan’s MRT is 
about 2.5%, .21% for Tukey’s, and .01% for Scheffé’s procedure, showing the conservativeness of the latter 
two procedures. 


On an experimentwise basis, the error rate in Tukey’s HSD and Newman-Keuls’ multiple range test 
remains constant at 5% as t increases from 2 to 10, while for Duncan’s MRT and Fisher’s LSD, it increases to 
38% and 63% respectively. For Scheffé’s procedure, it decreases from 5% to .23%, showing conservativeness 
of the Scheffé procedure for pairwise contrasts. Thus, with t = 10 populations with equal means (and (10 x 0)/2 
= 45 possible pairwise comparisons), there is a 38% probability that one or more of the 45 comparisons will be 
declared significantly different by Duncan’s procedure. 


In view of this rather high experimentwise probability, Gill (1973) recommends that Duncan’s procedure 
be discontinued. Of course, Gill has even stronger feelings against the LSD procedure. In defense of these 
two procedures, the comparison, rather than the experiment, is the basic unit for the comparisonwise 
adherents. One wrong conclusion will not affect the usefulness of the remaining 44 comparisons. On the other 
hand, the rationale of the experimentwise error rate philosophy is that one wrong comparison vitiates all of 
the remaining 44 comparisons. Thus, making one wrong conclusion is as serious as making 45 wrong 
judgments in the same experiment (is this reasonable, in most cases?). We have to ensure that ali 45 
comparisons are correct, not without having to pay a high premium, of course. For example, in a cubic lattice 
design with t = 729 varieties, (Cochran and Cox 1957, page 423), it will be virtually impossible to ensure that 
all (729 x 728)/2 = 265,356 paired comparisons will be judged correctly. 


Because of the independence of the validity of the individual comparisons (in the comparisonwise school), 
we can “afford” one wrong comparison out of 45. After all, in a 5% test, there is a one in 20 chance of an 
incorrect rejection so that out of 45 comparisons we should expect and tolerate about two wrong conclusions. 
In addition to the probability of one or more wrong rejections out of 45, it will be interesting to know also the 
probability of two or more wrong rejections. If the probability of two or more incorrect conclusions is 
considerably lower than that of one or more wrong conclusions, this should remove much of Gill’s objections to 
Dunecan’s MRT and Fisher’s LSD procedures. 


In agricultural experiments, the treatment means are much more likely to be unequal so that Type II 
error consideration should be at least as important as Type I error consideration. In the Boardman and 
Moffitt study, the procedures were applied without a prior significant overall F test, which is, in fact, a 
prerequisite of the Fisher’s protected LSD method. Although not required for the Duncan procedure, it may 
be desirable to apply the procedure only after a significant F test. As Dunnett (1970) points out, multiple 
comparison procedures are techniques for ferreting out differences among the t means, and there is no reason 
for doing so, unless there is an indication that differences exist, either a priori or as evidenced by a significant 
F test. The experimentwise error rates for the protected Fisher’s LSD and the “protected” Duncan’s MRT 
will, of course, be 5%. See Bernhardson (1975). 
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Based on the Boardman-Moffitt study (who considered only the null case of equal means), Gill recom- 
mended Tukey’s HSD and, to a lesser extent, the Newman-Keuls’ procedure. In another simulation study, 
Carmer and Swanson (1973) recommended just the opposite. Their conclusions were: 

. . . that Scheffé’s test, Tukey’s test, and the Student-Newman-Keuls’ test are less appropriate than either the least 
significant difference with the restriction that the analysis of variance F value be significant at a = .05, two Bayesian 
modifications of the least significant difference or Duncan’s multiple range test. Because of its ease of application, many 
researchers rnay prefer the restricted least significant difference. 

Carmer and Swanson conducted 88,000 simulations in all, with various numbers of treatments and 
replicates, and different patterns of heterogeneity among the treatment means. The study “was prompted 
mainly by the authors’ own uncertainty as to the most appropriate procedure to recommend to students and 
researchers in the agricultural sciences.” In an earlier publication, Carmer and Swanson (1971) reported on 5 
of the present 10 procedures. 

The following multiple comparison procedures were studied: 

LSD (unprotected) 

TSD (Tukey’s HSD) 

SNK (Student-Newman-Keuls) 

MRT (Duncan’s multiple range test) 

SSD (Scheffé’s procedure) 

FSD1 (Fisher’s protected LSD, with the preliminary F test applied at the 1% level) 

FSD2 (as in FSD1 but F test at 5% level) 

. FSD3 (as in FSD1 but F test at 10% level) 

. BSD (Duncan’s approximate Bayesian k-ratio LSD rule for t = 15 treatment and error d.f. v2 30; 
see Equation (3.6e) of present report) 

BET  (Waller-Duncan’s exact Bayesian k-ratio LSD rule) 
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We quote from Section 7 (“Concluding Remarks”) of Carmer and Swanson (1973): 

. . the SSD should never be employed for pairwise multiple comparisons. . . the TSD and SNK are clearly inferior in 
ability to detect real differences. Although the SSD, TSD, and SNK provide excellent protection against Type I errors, it is the 
authors’ feeling that, in evaluation of the various procedures, concern for ability to detect real differences should receive a high 
priority. . . the FSD1 procedure also appears to stress protection against Type I errors at the expense of sensitivity. . . it 
also seems reasonable not to recommend procedures which unduly deemphasize protection against Type I errors. From this 
point of view, then, the ordinary LSD and perhaps the FSD8 can be eliminated from consideration; in addition, their 
sensitivities to real differences are not appreciably greater than those of the FSD2, BSD, BET, and MRT. These latter four 
procedures thus constitute a group from which the consulting statistician or experimenter might generally make achoice. . . 
while the MRT often produces a lower frequency of Type I errors, the other three are generally more sensitive in detecting real 
differences . . . dependence of the critical value on the observed analysis of variance F value is more appealing than 
dependence on the number of treatments in the experiment. Since the BET is an improved and more exact version than the 
BSD, it seems reasonable to prefer the former. . . the procedure (BET) is easier to apply than the MRT. . . many subject 
matter researchers will find the FSD2 attractive because of its simplicity and the fact that they are already familiar with 
Student’s t table. 

Carmer and Swanson’s final choice is thus between FSD2 and BET. Waller and Duncan (1969) claim that 
the similarity in performance between the FSD2 and BET says a lot for BET, but as Carmer and Swanson 
point out, it is just as reasonable to claim that this similarity speaks a lot for the FSD2. 

Thomas (1974) compared “seven methods of pairwise comparisons and four for constructing simultane- 
ous sets of confidence limits. The general conclusions are that Duncan’s multiple range test is the best method 
of those considered for the former and the Bonferroni t-based limits for the latter.” 

We mentioned at the beginning of this chapter that one main difficulty in comparing the procedures is due 
to the different kinds of Type I error rates used. Comparing one procedure using a 5% comparisonwise Type I 
error rate with another procedure using a 5% experimentwise Type I error rate is almost like comparing 
oranges with bananas. As Einot and Gabriel (1975) pointed out, any observed difference in the performance of 
the two procedures is more likely to be due to the different Type I error probabilities than to the techniques 
used. Therefore, one should force all procedures to have the same experimentwise (or comparisonwise) Type 
I error rate and compare their powers, as in the Neyman-Pearson two-decision situations. With orthogonal 
contrasts and large numbers of degrees of freedom for error mean square, we have seen in Section 3.1 that for 
t =10treatments, say, a5% experimentwise error rate corresponds to a .57% comparisonwise error rate, and 
a 5% comparisonwise error rate is equivalent to a 36.98% experimentwise error rate. 
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Einot and Gabriel (1975) studied the powers of multiple comparison procedures for fixed maximal 
experimentwise levels, and“. . . generally recommend the Tukey technique for its elegant simplicity and 
existent confidence bounds—its power is little below that of any other method. Simulation was for 8, 4, and 5 
treatments: the conclusions might need modification for more treatments.” 

No doubt the reader will think that the last word has not been written on the choice of a multiple 
comparison procedure. (Some statisticians do not even believe in multiple comparisons. In his discussion of 
the review paper by O’Neill and Wetherill (1971), R. L. Plackett expressed his “view that much of the subject 
of multiple comparisons is essentially artificial,” while J. A. Nelder went so far as stating that in his opinion 
“multiple comparison methods have no place at all in the interpretation of data.”) In the final analysis, the 
choice will be subjective. Toa very large extent, this choice will hinge on a choice between an experimentwise 
error rate (for which Tukey’s HSD is the recommended procedure) and a comparisonwise error rate (for 
which Duncan’s MRT is recommended). As mentioned earlier, the author’s opinion is that in the majority of 
cases, the comparisonwise basis is more appropriate since one wrong inference usually does not make the 
other inferences in the same experiment meaningless. There is really not that much difference between the 
methods. We can remove or reduce objections to Duncan’s MRT by requiring an initial significant overall F 
test or by taking Duncan’s comparisonwise a to be 0.01 or 0.001. Similarily, we can remove or reduce 
objections to Tukey’s HSD by taking Tukey’s experimentwise a to be 0.10 or 0.25, but, as Einot and Gabriel 
wondered, it may be that “it does not seem scientifically respectable to work explicitly with a level of 0.25.” 

The choice of the kind of Type I error rates is bypassed altogether in the Waller-Duncan Bayesian k-ratio 
LSD rule. It also has the extremely appealing feature that the observed F value is used in the calculation of 
the LSD. With a large F (of 3.0 and above, indicating strong evidence of existence of differences), the test 
behaves like the comparisonwise procedures (Duncan’s MRT and Fisher’s LSD) with good power properties, 
while for a small F’, it becomes conservative with good protection against Type I error, as in the Tukey HSD 
procedure. It is as if the choice between a comparisonwise and an experimentwise error rate is taken out of 
the experimenter’s hands and is determined by the experiment itself (the experimental F value). “In this way 
the decision theoretic rule enjoys the advantages of both comparisonwise and experimentwise a rules without 
their disadvantages.” (Dixon and Duncan 1975, p. 822). This procedure will become more popular in the 
future, especially if more extensive tables become available. 
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TaBLeE B.—Percentage points of the studentized range q(a;p,v)* 
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TaBLE B.—Percentage points of the studentized range q(a;p,v)*—Continued 
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50.59 
14.39 
Seti, 
8.027 
7.168 
6.649 
6.302 
6.054 
5.867 
5.722 


5.605 
5.511 
5.431 
5.364 
5.306 
5.256 
5.212 
5.174 
5.140 
5.108 


5.012 
4.917 
4.824 
4.732 
4.641 
4.552 


51.96 
14.75 
9.946 
8.208 
7.324 
6.789 
6.431 
6.175 
5.983 
5.833 


5.718 
5.615 
5.533 
5.463 
5.404 
5.352 
5.307 
5.267 
5.231 
5.199 


5.099 
5.001 
4,904 
4.808 
4.714 
4.622 


53.20 
15.08 
10.15 
8.373 
7.466 
6.917 
6.550 
6.287 
6.089 
5.935 


5.811 
5.710 
5.625 
5.554 
5.493 
5.439 
5.392 
5.352 
5.315 
5.282 


5.179 
5.077 
4.977 
4.878 
4.781 
4.685 


54.33 
15.38 
10.35 
8.525 
7.596 
7.034 
6.658 
6.389 
6.186 
6.028 


5.901 
5.798 
5.711 
5.637 
5.574 
5.520 
5.471 
5.429 
5.391 
5.357 


5.251 
5.147 
5.044 
4.942 
4.842 
4.743 


57.22 
16.14 
10.84 
8.914 
7.932 
7.338 
6.939 
6.653 
6.437 
6.269 


6.134 
6.023 
5.931 
5.852 
5.785 
5.727 
5.675 
5.630 
5.589 
5.553 


5.439 
5.327 
5.216 
5.107 
4,998 
4.891 


58.04 
16.37 
10.98 
9.028 
8.030 
7.426 
7.020 
6.729 
6.510 
6.339 


6.202 
6.089 
5.995 
5.915 
5.846 
5.786 
5.734 
5.688 
5.647 
5.610 


5.494 
5.379 
5.266 
5.154 
5.044 
4,934 


58.83 
16.57 
Te 
9.134 
8.122 
7.508 
7.097 
6.802 
6.579 
6.405 


6.265 
6.151 
6.055 
5.974 
5.904 
5.843 
5.790 
5.743 
5.701 
5.663 


5.545 
5.429 
5.313 
5.199 
5.086 
4.974 


TABLE B.—Percentage points of the studentized range q(a;p,v)*—Continued 


120 4 126 5.200 5.266 5.327 5.382 5.434 5.481 5.526 5.568 


5.012 5.081 5.144 5.201 5.253 5.301 5.346 5.388 5.427 
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TABLE B.—Percentage points of the studentized range q(a;p,v)*—Continued 
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TasLE B.—Percentage points of the studentized range q(a;p,v)*—Continued 
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TABLE B.—Percentage points of the studentized range q(a;p,v)*—Continued 


82.59 


SCM MN OTP WN HE 


33.40 
17.53 
12.84 
10.70 
9.485 
8.711 
8.176 
7.784 
7.485 


7.250 
7.060 
6.903 
6.772 
6.660 
6.564 
6.480 
6.407 
6.342 
6.285 


6.106 
5.932 
5.764 
5.601 
5.443 
5.290 


a= .01 


266.2 
34.13 
17.89 
13.09 
10.89 
9.653 
8.860 
8.312 
7.910 
7.603 


7.362 
7.167 
7.006 
6.871 
6.757 
6.658 
6.572 
6.497 
6.430 
6.371 


6.186 
6.008 
5.835 
5.667 
5.505 
5.348 


271.8 
34.81 
18/22 
13.32 
11.08 
9.808 
8.997 
8.436 
8.025 
T.112 


7.465 
7.265 
felOn 
6.962 
6.845 
6.744 
6.656 
6.579 
6.510 
6.450 


6.261 
6.078 
5.900 
5.728 
5.562 
5.400 


277.0 
30.43 
18.52 
13.53 
11.24 
9.951 
9.124 
8.552 
8.132 
7.812 


7.560 
7.356 
7.188 
7.047 
6.927 
6.823 
6.734 
6.655 
6.585 
6.523 


6.330 
6.143 
5.961 
5.785 
5.614 
5.448 


281.8 
36.00 
18.81 
13.73 
11.40 
10.08 
9.242 
8.659 
8.232 
7.906 


7.649 
7.441 
7.269 
7.126 
7.003 
6.898 
6.806 
6.725 
6.654 
6.591 


6.394 
6.203 
6.017 
5.837 
5.662 
5.493 


286.3 
36.53 
19.07 
13.91 
11.55 
10.21 
9.353 
8.760 
8.325 
7.993 


7.732 
7.520 
7.3845 
(MES 
7.074 
6.967 
6.873 
6.792 
6.719 
6.654 


6.453 
6.259 
6.069 
5.886 
5.708 
5.535 


5.074 


294.3 
37.50 
19.55 
14.24 
11.81 
10.43 
9.554 
8.943 
8.495 
8.153 


7.883 
7.665 
7.485 
7.333 
7.204 
7.093 
6.997 
6.912 
6.837 
6.771 


6.563 
6.361 
6.165 
5.974 
5.790 
5.611 


TABLE B.—Percentage points of the studentized range q(a;p,v)*—Continued 
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21.95 
15.92 
13.15 
11.58 
10.58 
9.874 
9.360 
8.966 


8.654 
8.402 
8.193 
8.018 
7.869 
7.739 
7.627 
7.528 
7.440 
7.362 


7.119 
6.881 
6.650 
6.424 
6.204 
5.990 


338.0 
42.78 
22.17 
16.08 
13.28 
11.69 
10.67 
9.964 
9.443 
9.044 


8.728 
8.473 
8.262 
8.084 
7.932 
7.802 
7.687 
7.587 
7.498 
7.419 


7.173 
6.932 
6.697 
6.467 
6.244 
6.026 
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TABLE B.—Percentage points of the studentized range q(a;p,v)*—Continued 


=f 

v P| 38 | 40 50 60 70 80 90 100 
1 341.5 344.8 358.9 370.1 379.4 387.3 394.1 400.1 
2 43,21 43.61 45.33 46.70 47.83 48.80 49.64 50.38 
3 22.39 22.59 23.45 24.13 24.71 25.19 25.62 25.99 
4 16.23 16.37 16.98 17.46 17.86 18.20 18.50 18.77 
5 13.40 13.52 14.00 14.39 14.72 14.99 15.23 15.45 
6 11.80 11.90 12.31 12.65 12.92 13.16 13.37 13.55 
q 10.77 10.85 11.28 11.52 11.77 11.99 12.17 12.34 
8 10.05 10.13 10.47 10.75 10.97 11.17 11.34 11.49 
9 9,521 9.594 9.912 10.17 10.38 10.57 10.73 10.87 
10 9.117 9.187 9.486 9.726 9.927 10.10 10.25 10.39 
11 8.798 8.864 9,148 9.377 9.568 9.732 9.875 10.00 
12 8.539 8.603 8.875 9.094 9.277 9.434 9.571 9.693 
13 8.326 8.387 8.648 8.859 9.035 9.187 9.318 9.436 
14 8.146 8.204 8.457 8.661 8.832 8.978 9.106 9.219 
15 7.992 8.049 8.295 8.492 8.658 8.800 8.924 9.035 
16 7.860 7.916 8.154 8.347 8.507 8.646 8.767 8.874 
17 7.745 7.799 8.031 8.219 8.377 8.511 8.630 8.735 
18 7.643 7.696 7.924 8.107 8.261 8,393 8.508 8.611 
19 7.553 7.605 7.828 8.008 8.159 8.288 8.401 8.502 
20 7.473 7.523 7.742 7.919 8.067 8.194 8.305 8.404 
24 7.223 7.270 7.476 7.642 7.780 7.900 8.004 8.097 
30 6.978 7.023 7.215 7.370 7.500 7.611 7.709 7.796 
40 6.740 6.782 6.960 7.104 7.225 7.328 7.419 7.500 
60 6.507 6.546 6.710 6.843 6.954 7.050 7.133 7.207 
120 6.281 6.316 6.467 6.588 6.689 6.776 6.852 6.919 
es 6.060 6.092 6.228 6.338 | 6.429 6.507 6.575 6.636 


Source: Reproduced from H. Leon Harter, Order Statistics and Their Use in Testing and Estimation, 
vol. 1 (1970), pp. 623-661, U.S. Government Printing Office, Washington, D.C., with the permission of 
the author. 
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TABLE D1.—Critical values of k-ratio t test (k = 100) 
udenominator d.f. for F) 


q(num. d.f. for F) 6 8 
8 2.91 2.94 
10 2.93 2.98 
12 2.95 3.01 
14 2.96 3.03 
16 2.97 3.05 
20 Z.99 3.08 
40 3.02 3.13 
100 3.04 3.17 
oo 3.05 3.20 

24 eS 
6 2.85 2.84 
8 2.88 2.89 
10 2.90 2.93 
12 2.92 2.95 
14 2.93 2.97 
16 2.94 8) 
20 2.95 3.01 
40 2.98 3.06 
100 2.99 3.09 
00 3.01 3.12 

2 * * 

4 * * 
6 2.82 2.19 
8 2.84 2.83 
10 2.86 2.86 
12 2.87 2.88 
14 2.88 2.90 
16 2.89 2.91 
20 2.90 2.93 
40 2.93 2.97 
100 2.94 2.99 
0 2.95 3.01 

2 * * 
4 2.74 2.67 
6 2.19 2.74 
8 2.81 2.17 
10 2.83 2.80 
12 2.84 2.02 
14 2.85 2.83 
16 2.85 2.84 
20 2.86 2.85 
40 2.88 2.89 
100 2.89 2.91 
% 2.90 2.92 


See footnotes at end of table. 


10 


12 


14 16 


* 


2.98 2.99 
3.05 3.06 
3.10 3.12 
3.14 3.16 
3.18 3.20 
3.23 3.26 
3.35 3.39 
3.44 3.50 
3.50 3.58 


18 


20 


F = 1.2 (a = .918, b = 2.449) 
* * * 


2.99 
3.08 
3.14 
3.19 
3.24 
3.30 
3.47 
3.59 
3.70 


F = 1.4 (a = .845, b = 1.871) 


* * 
2.82 2.81 
2.90 2.89 
2.95 2.96 
3.00 3.00 
3.03 3.04 
3.06 3.07 
3.10 3.11 
3.19 3.22 
3.26 3.29 
3.31 3.35 


* 


2.80 
2.89 
2.96 
3.01 
3.04 
3.08 
3.12 
3.24 
3.32 
3.39 


F = 1.7 (a = .767, b = 


* * 

‘ 2.61 
2.72 2.71 
2.78 2.77 
2.83 2.82 
2.86 2.85 
2.89 2.88 
2.90 2.90 
2.93 2.93 
3.00 3.00 
3.05 3.05 
3.08 3.09 


* 


F = 2.0 (a = .707, b = 1. 


* * 
2.56 2.54 
2.64 2.62 
2.69 2.67 
Come 0 
EE 
Se OS 
218) eee 
2.80 2.78 
2.85. 2.83 
2.88 2.86 
2.90 2.88 


* 


2.52 
2.60 
2.65 
2.69 
2.71 
2.73 
2.74 
2.77 
2.81 
2.84 
2.86 


24 
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TABLE D1.—Critical values of k-ratio t test (k = 100)—Continued 


v(denominator d.f. for F) 


q(num. d.f. for F) 6 8 10 12 14 16 18 20 24 30 40 60 120 
F = 2.4 (a = .645, b = 1.3809) 
Pe iB * * * * * & * * * * * 2. 18 
4 2.71 2.63 Patsy Oo 2.49 2.47 2.44 2.43 2.40 mot 2.34 PepoM 2.28 
6 PAS 2.68 2.63 2.58 2.55 2.52 2.50 2.48 2.46 2.42 2.39 2.36 2.32 
8 PACH 2.71 2.66 2.62 2.59 2.56 2.54 Pape) 2.49 2.40 2.42 2.38 2.34 
10 2.79 2.73 2.68 2.64 2.61 2.58 2.56 2.54 2.50 2.47 2.43 2.39 2.34 
12 2.79 2.74 2.70 2.66 2.62 2.60 2.57 2,00 2.62 2.48 2.44 2.09 2.35 
14 2.80 Pang (5) 2.71 2.67 2.64 2.61 2.58 2.56 22De 2.49 2.44 2.40 2.35 
16 2.81 2.76 PAN 4 2.68 2.65 2.62 2.59 2.57 2.538 2.49 2.45 2.40 2.34 
20 2.82 PAH ale 2.69 2.66 2.63 2.60 2.58 2.54. 2.50 2.45 2.40 2.34 
40 2.83 2.80 2.76 uke 2.69 2.66 2.63 2.60 2.56 Pa oyl 2.46 2.39 Aes 
100 2.84 2.81 2.78 2.74 ail 2.67 2.64 2.62 DEOL 25) 2.45 2.39 (eae 
© 2.85 2.83 2.79 2.76 2.72 2.68 2.65 2.62 PASH 2.51 2.45 2.38 al 
BF =3.0 (@ = .577, b = 1-225) 
2 * 2.41 2.36 Peeves 2.29 Peat pAvass yA yd 2.20 Zap Uff 2.14 all 
4 2.68 PAL 2.50 2.45 2.41 2.38 2.35 2.30 2.30 7d AE 2.24 2.20 Pali 
6 vat fl 2.61 2.54 2.49 2.44 2.41 2.39 2.36 2.33 2.29 2.26 De 2.18 
8 Dale 2.63 2.56 ZED 2.47 2.43 2.40 2.38 2.34 PACD Beak Boze 2.18 
10 2.74 2.65 2.58 2.52 2.48 2.44 2.41 2.39 2.35 2.31 eed 7h Peed 2.18 
12 2.74 2.66 2.59 2.53 2.49 2.45 2.42 2.40 2.36 2.01 Zak ule 2.18 
14 PASTS 2.66 2.60 2.54 2.49 2.46 2.43 2.40 2.36 Ph ae rapa Dee Zl 
16 2.45 2.67 2.60 PAgs5) 2.50 2.46 2.43 2.40 2.36 PRS Y Payal eee 2.17 
20 2.76 2.68 2.61 2.59 2.51 2.47 2.43 2.41 2.36 2.32 PAPAL Dees Zoue 
40 PAH 2.70 2.63 2.57 2.52 2.48 2.44 2.41 2.30 Zoe 2.26 AVA 26 
100 2.78 Pa 2.64 2.58 2203 2.49 2.45 2.42 Z.at iol! 2.26 ee 216 
oo 2.79 atl 2.65 2.59 2.53 2.49 2.45 2.42 PAG Ziol 2.26 2.20 Daley 
F = 4.0 (a = .500, b = 1.155) 
2 2.58 2.44 2.35 2.29 P47) Deed 2.20 2.18 raps ls YAAWA 2.09 2.06 2.03 
4 2.63 2.50 2.41 2.00 2.30 AGH 2.24 Dene 2.18 a Vasile? 2.08 2.05 
6 2.65 2.52 2.43 2:08 2.32 2.28 2.25 2.23 2.19 2.16 elkZ, 2.08 2.04 
10 2.67 2.55 2.46 2.39 2.34 2.30 2.26 2.24 2.20 2.16 ele 2.08 2.04 
20 2.69 2.57 2.47 2.40 2.35 2.30 PAH | 2.24 2.20 215 VAL 2.07 2.03 
© PALE 2.59 2.49 2.42 2.36 2.31 VA | 2.24 2.19 2.15 PAB 2.06 2.02 
F = 6.0 (a= .408, b = 1.095) 
2 2.53 2.37 PAPAL aA 2.16 2.138 2.10 2.08 2.05 2.02 1.99 1.96 1.93 
4 2.56 2.40 2.30 Bice 2.18 One 2.12 2.09 2.06 2.02 1.99 1.96 1.93 
6 2.58 2.42 2.31 2.24 2.19 2.15 ake 2.09 2.06 2.02 1.99 1.96 1.92 
10 2.59 2.43 2.32 2.24 2.19 2.15 ule 2.09 2.06 2.02 1.99 1.95 1.92 
20 2.60 2.44 2.02 2.25 2.19 2.15 PAS WA 2.09 2.05 2.02 1.98 1.95 1.92 
ro 2.61 2.44 2.33 2.25 2.19 2.15 Pa WA 2.09 2.05 2.02 1.98 1.95 1.92 
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See footnotes at end of table. 


TABLE D1.—Critical values of k-ratio t test (k = 100)—Continued 
v(denominator d.f. for F) 


q(num. d.f. for F) 6 8 10 12 14 16 18 20 24 30 40 60 120 


F= 10.0 (a = .316, b = 1.054) 


2 2.48 2.30 ald 2.12 2.07 2.04 2.01 1.99 1.96 1.93 1.90 1.87 1.85 
4 2.49 2.31 2.20 2.18 2.08 2.04 2.01 1Foo 1.96 1.93 1.90 1.87 1.84 
6 2.50 2.31 2.20 2.18 2.08 2.04 2.01 199 1.96 1.93 1.90 1.87 1.84 


10—2 2.51 2.32 2.20 2.13 2.08 2.04 2.01 1-99 1.96 1.93 1.90 1.87 1.84 
F = 25.0(a = .200, b = 1.021) 


2-4 2.40 2.20 2.10 2.03 199 1.95 1.93 1.91 1.88 1.86 1.83 1.80 1.78 
6-00 2.41 2.21 2.10 2.03 1.99 1.95 1.93 1.91 1.88 1.86 1.83 1.80 1.78 


2-20 2.33 2.18 2.03 MEE 1.93 1.90 1.88 1.86 1.84 1.81 119 1.76 1.74 


*All differences not significant. a = 1/F* b = [FF — 1)]* 
If v=4, t=2.83 for all q and F satisfying F > 8.12/q. 


Source: Reproduced from Waller, Ray A., and Duncan, David B., A Bayes Rule for the Symmetric Multiple Comparisons Problem, 
Corrigenda, Journal of the American Statistical Association, vol. 67 (1972), with permission of author and publisher. 
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TABLE D2.—Critical values of k-ratio t test (k =500) 


v (denominator df. for F) 


q (num. d.f. for F) 6 8 10 12 14 16 18 20 24 30 40 60 120 
F= 1.2 (a =.913, b = 2.449) 

2-16 * * * * * * * * * * * * * 
20 4.70 4.82 4.89 : : : : : : : : : ; 
40 9.4.95 | (491 45.03 995.12 9520 95.25 580 Gbe4 ovbdl 05.48 90555 eb 6mm DeT 

100 «4.79 «6-498 6 518525 BA 548 O06 GS BG B89) 6.02 (6.18 
02 4.81 503 5.20 5.34 546 5.56 565 5.73 5.86 602 620 641 6.56 

F = 1.4 (a= .845, b = 1.871) 

2-14 * * * * * * * * * * * * * 
16 461 4.66 468 469 469 469 469 468 467 4.65 462 .4.58 4.53 
20. 64 A048 4b AG ATE 6 AG 4A ee GS ea be 
40 468 4.78 485 489 492 494 496 496 497 497 495 490 4.81 
00 ATA = 4.88* 4.99" 15,06 = 512" 5,17 1520 o.28- 6.26” 5.28 eGo 16nd Be 

F = 1.7 (a = .767, b = 1.558) 

2-8 * * * * * * * * * * * * * 
10 : : : : ; ; : : a 4.08" 402 eso, SET 
12 4.50 4.46 442 498 484 4.30 4.27 424 419 414 407 3.99 3.90 
20 455 454 452 449 446 443 440 437 482 426 418 4.08 3.95 
40 459 461 461 460 457 455 452 449 444 436 426 412 3.98 
o 464 469 4.71 4.72 4.71 469 466 463 457 446 431 4.07 3.76 

F = 2.0 (@ =.707, b = 1.414) 

2-6 * * * * * %* * * * * * * * 
8 : : : : «0, ho93 wees 68 9 3.88 9.76" 908.69" | 95.60 gee bl 
10 441 481 422 415 408 403 398 3.94 388 380 3.72 8.63 3.53 
20 448 441 434 427 421 416 410 406 398 389 378 365 3.51 
40 451 447 441 435 429 4.23 417 412 4083 392 3.78 3.62 3.44 
00 455 468 449 443 487 431 425 419 407 393 3.75 3.64 3.88 

F = 2.4 (a = .645, b = 1.309) 

phat) * * * * * * * * * * * * * 
6 : ; : */ BT p23. Tl 3.65 E868 1 S:545 $.47-98 98.89 7 wa Omen 
8 A.B) 41d 4.01" 8:91 | 3.880 8876. 8708.66. 9.68 «2 8.50 “S41 gee 
10 4.33 418 405 8.96 3.87 3,79 2730 3.68 > 8600 35% 342s eis al 
20 439 4.26 414 404 395 3.87 380 3.74 364 353 341 328 38.15 
00 445 435 425 414 403 394 385 3.78 364 350 334 318 3.04 

F = 3.0-(a = .577, b = 1.225) 
y * * * * * * * * * * * * * 
4 : : : : "348" 93:98" 3.381 7826 93:19, asec Ot ne 7 
6 419 895 3.79 866 356 349 343 337 330 321 818 804 2.95 
10 4.24 “42 3.85. 3.72 3.62) 3.58) 346 9 8.40 "8.81 | 3.21) aise tome 
20 428 6408 3.91 3.77 365 3.5609 348 341 “3.3t 8.20° "98.007 eo Comms 
00 438 415 $3.97 3.82 3.69 “3.57 9.48 3.40 3:28 9.16 9 8/03 2 oom 
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TABLE D2.—Critical values of k-ratio t test (k = 500)—Continued 
v (denominator d.f. for F) 


q(num. d.f. for F) 6 8 10 12 14 16 18 20 24 30 40 60 120 


F = 4.0 (a = .500, b = 1.155) 


2 * * * * * * * * * * * 2.81 2.75 
4 ¥ 3.74 3.54 3.40 3.30 3.22 3.16 3.11 3.04 2.96 2.89 2.81 2.74 
6 4,08 3.78 3.58 3.43 3.32 3.24 3.17 3.12 3.04 2.95 2.87 219 2.71 
10 4,12 3.83 3.62 3.46 3.34 3.25 3.17 3.11 3.03 2.94 2.85 2.717 2.69 
20 4.15 3.86 3.64 3.48 3.35 3.25 3.17 3.10 3.01 2.92 2.83 2.74 2.66 
90 4.19 3.90 3.67 3.49 3.35 3.24 3.15 3.09 2.99 2.89 2.80 2.72 2.65 


F = 6.0 (a = .408, b = 1.095) 


2 . * Bos) "atti" god 70.07 Wico.o1m WBor7 « lost 2744 2.687! 2.62 2.86 
4 Boor “Me54, —-giea) 317106” o-98%o.92 Mae? SeibO> «olvah “2.66'.! 260 © 2.68 
6 S05) (Pacay |)Psaae Goes) 9/06 etre 98° 492.01 Sees .f2579) 272k 26575 2.58. 2.62 
10 GoGe MeB.6O me oNedin ea.18™) | 9.06% 2.978891, Bee fee | 6 tS fed) 267 (fb 
20 $07. MiS.602.. -Gis5e 8.18 99,069 207) 2.90, Moss BeY AS 2687 F ose 2.61 
00 $99 8.62. 8.36 3.18 38.05 296° 2.89 288 276 269° 262 2.56 2.50 


F = 10.0 (a = .316, b = 1.054) 


2 3.72 3.33 3.10 2.96 2.86 2.79 2.74 2.70 2.64 2.58 2.52 2.47 2.42 


4 3.75 3.35 3.11 2.96 2.86 2.79 2.73 2.69 2.63 2.57 2.51 2.46 2.41 
10 3.78 3.36 3.11 2.96 2.85 2.78 2.72 2.68 2.62 2.56 2.50 2.45 2.40 
20 3.79 3.36 3.11 2.96 2.85 2.78 2.72 2.68 2.62 2.56 2.50 2.45 2.40 
2 3.80 3.37 3.11 2.95 2.85 2.17 2.72 2.67 2.61 2.56 2.50 2.45 2.40 


F = 25.0 (a = .200, b = 1.021) 


2 3.55 3.14 2.92 2.19 2.70 2.64 2.59 2.56 2.51 2.46 2.41 2.36 2.32 
10 3.57 3.14 2.92 2.79 2.70 2.64 2.59 2.55 2.50 2.45 2.41 2.36 2.32 
00 3.57 3.14 2.92 2.78 2.70 2.63 2.59 2.55 2.50 2.45 2.41 2.36 2.32 


F = ~ (a = 0, b = 1) 


2-0 3.39 3.00 2.80 2.69 2.61 2.55 2.51 2.48 2.44 2.39 2.35 2.31 2.20 


*All differences not significant. a = 1/F¥, b = [F/(F - 1)]*. 
If v=4, t = 4.52 for all q and F satisfying F > 20.43/q. 
Source: Reproduced from Waller, Ray A., and Duncan, David B. A. Bayes Rule for the Symmetric Multiple Comparisons Problem, 
Corrigenda, Journal of the American Statistical Association, vol. 67 (1972), pp. 253-255, with the permission of the author and the 
publisher. 
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TABLE E..-100y% points of the distribution of the largest absolute value of k uncorrelated Student t variates 
with v degrees of freedom 
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5.841 7.127 7.914 
4.604 5.462 5.985 
4.032 4,700 5.106 
3.707 4.271 4.611 
3.500 3.998 4.296 
3.350 3.809 4.080 
3.250 3.672 3.922 
3.169 3.067 3.801 
3.106 3.485 3.707 
3.055 3.418 3.631 
2.947 3.279 3.472 
2.845 3.149 3.323 
2.788 3.075 3.239 
2.750 3.027 3.185 
2.705 2.969 3.119 
2.660 2.913 3.055 


8.479 
6.362 
5.398 
4.855 
4,510 
4,273 
4.100 
3.969 
3.865 
3.782 
3.608 
3.446 
3.354 
3.295 
3.223 
3.154 


5 6 
y=0.90 
3.844 4.011 
3.368 3.506 
3.116 3.239 
2.961 3.074 
2.856 2.962 
2.780 2.881 
2.723 2.819 
2.678 PITA 
2.642 2.733 
2.612 2.701 
2.548 2.633 
2.486 2.567 
2.450 2.528 
2.426 2.502 
2.397 2.470 
2.368 2.439 
y = 0.95 
5.023 5.233 
4.203 4.366 
3.789 3.928 
3.541 3.664 
3.376 3.489 
3.258 3.365 
3.171 3.272 
3.103 3.199 
8.048 3.142 
3.004 3.095 
2.910 2.994 
2.819 2.898 
2.766 2.842 
2.782 2.805 
2.690 2.760 
2.649 2.716 
y = 0.99 
8.919 210 
6.656 6.897 
5.625 5.812 
5.046 5.202 
4.677 4.814 
4.424 4.547 
4.239 4.353 
4.098 4.205 
3.988 4.087 
3.899 3.995 
3.714 3.800 
3.541 3.617 
3.442 3.514 
3.379 3.448 
3.303 3.367 
3.229 3.290 


8 


3.384 


10 


3.456 


12 


4.631 
4.020 
3.694 
3.493 
3.355 
3.255 
3.179 
3.120 
3.072 
3.032 
2.947 
2.863 
2.814 
2.781 
2.741 
2.701 


6.015 
4.975 
4,447 
4,129 
3.916 
3.764 
3.651 
3.562 
3.491 
3.433 
3.309 
3.190 
3.121 
3.075 
3.019 
2.964 


10.616 
7.801 
6.519 
5.796 
5.335 
5.017 
4.785 
4.609 
4.470 
4.359 
4.125 
3.907 
3.783 
3.704 
3.607 
3.515 


15 


20 


Source: Reproduced from Hahn and Hendrickson (1971), Biometrika 58, p. 323, with the permission of the author and publisher. 
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TABLE F2.—Critical values of t(a;a,v) for two-sided Dunnett’s tests for comparing control against each of q 
other treatments 7 


Source: Reproduced from C.W. Dunnett, New tables for multiple comparisions with a control, Biometrics 20 (1964), with the 
permission of the author and the editor. 
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TABLE G.—Critical values of t(a;p,v) for testing zero against nonzero dose levels 
(p = number of nonzero levels) 


p 


2.02 2.14 2:19 2.21 2.22 2.23 2.24 2.24 2.25 2.25 
1.94 2.06 2.10 2.12 2.13 2.14 2.14 2.15 2.15 2.15 


1.83 1.93 1.96 1.98 1.99 2.00 2.00 2.01 2.01 2.01 


10 1.81 or 1.94 1.96 1.97 1.97 1.98 1.98 1.98 1.98 
11 1.80 1.89 1.92 1.94 1.94 1.95 1.95 1.96 1.96 1.96 
12 1.78 1.87 1.90 1.92 1.93 1.93 1.94 1.94 1.94 1.94 
13 Les fh 1.86 1.89 1.90 its 1,92 1.92 1.93 1.93 1.93 
14 1.76 1.85 1.88 1.89 1.90 ILafeit 19) on 1.92 1.92 


15 1.75 1.84 1.87 1.88 1.89 1.90 1.90 1.90 1.90 ol 
16 1.75 1.83 1.86 1.87 1.88 1.89 1.89 1.89 1.90 1.90 
i, 1.74 1.82 1.85 1.87 1.87 1.88 1.88 1.89 1.89 1.89 
18 1.78 1.82 1.85 1.86 1.87 1.87 1.88 1.88 1.88 1.88 
19 1.73 1.81 1.84 1.85 1.86 1.87 1.87 1.87 1.87 1.88 


20 1.72 1.81 1.83 1.85 1.86 1.86 1.86 1.87 1.87 1.87 
22 1-72 1.80 1.83 1.84 1.85 1.85 1.85 1.86 1.86 1.86 
24 fal ibSehs) 1.82 1.83 1.84 1.84 1.85 1.85 1.85 1.85 
26 ei 1.79 1.81 1.82 1.83 1.84 1.84 1.84 1.84 1.85 
28 1.70 1.78 1.81 1.82 1.83 1.83 1.83 1.84 1.84 1.84 


30 1.70 1.78 1.80 1.81 1.82 1.83 1.83 1.83 1.83 1.83 
35 1.69 ILS Ae W(t) 1.80 1.81 1.82 1.82 1.82 1.82 1.83 
40 1.68 1.76 1.79 1.80 1.80 1.81 1.81 1.81 1.82 1.82 
60 1.67 1.75 Ife Seg eele do Maths) 1.80 1.80 1.80 1.80 
120 1.66 1.73 1.75 1 ET 1.78 1.78 1.78 1.78 1.78 
oo 1RG4 ee (16 ale oO ene LeOO meee (OOn mle (O0Ne = i(G3 0) ls160)) 1ei67) 1768 


TABLE G.—Critical values of t(ap,v) for testing zero against nonzero dose levels 
(p = number of nonzero levels) —Continued 


a=.01 


= 
4) 
oo 
to 
io) 
> 
oO 
lor) 
J 
oo 
ie) 
S 


5 3.36 3.50 3.55 3.57 3.59 3.60 3.60 3.61 3.61 3.61 
6 3.14 3.26 3.29 3.31 3.32 3.33 3.34 3.34 3.34 3.35 
tf 3.00 3.10 3.13 3.15 3.16 3.16 3.17 3.17 3.17 3.17 
8 2.90 2.99 3.01 3.03 3.04 3.04 3.05 3.05 3.05 3.05 
9 2.82 2.90 2.98 2.94 2.95 2.95 2.96 2.96 2.96 2.96 


10 2.76 2.84 2.86 2.88 2.88 2.89 2.89 2.89 2.90 2.90 
11 2.72 2.79 2.81 2.82 2.83 2.83 2.84 2.84 2.84 2.84 
12 2.68 2.75 2.17 2.78 2.79 2.79 2.79 2.80 2.80 2.80 
13 2.65 2.72 2.74 2.75 2.75 2.76 2.76 2.76 2.76 2.76 
14 2.62 2.69 2.71 2.12 2.72 2.73 2.73 2.73 2.73 2.78 


15 2.60 2.66 2.68 2.69 2.70 2.70 2.70 2.71 2.71 2.71 
16 2.58 2.64 2.66 2.67 2.68 2.68 2.68 2.68 2.68 2.69 
17 2.57 2.63 2.64 2.65 2.66 2.66 2.66 2.66 2.67 2.67 
18 2.55 2.61 2.68 2.64 2.64 2.64 2.65 2.65 2.65 2.65 
19 2.54 2.60 2.61 2.62 2.63 2.63 2.63 2.68 2.63 2.63 


20 2.53 2.58 2.60 2.61 2.61 2.62 2.62 2.62 2.62 2.62 
22 2.51 2.56 2.58 2.59 2.59 2.59 2.60 2.60 2.60 2.60 
24 2.49 2.55 2.56 2.57 2.57 2.57 2.58 2.58 2.58 2.58 
26 2.48 2.53 2.55 2.55 2.56 2.56 2.56 2.56 2.56 2.56 
28 2.47 2.52 2.53 2.54 2.54 2.55 2.55 2.55 2.55 2.55 


30 2.46 2.51 2.52 2.53 2.53 2.54 2.54 2.54 2.54 2.54 
35 2.44 2.49 2.50 2.51 2.51 2.51 2.51 2.52 2.52 2.52 
40 2.42 2.47 2.48 2.49 2.49 2.50 2.50 2.50 2.50 2.50 
60 2.39 2.43 2.45 2.45 2.46 2.46 2.46 2.46 2.46 2.46 
120 2.36 2.40 2.41 2.42 2.42 2.42 2.42 2.42 2.42 2.43 
00 2.326 2.366 2.377 2.382 2.385 2.3886 2.387 2.388 2.389 2.389 


Source: Reproduced from D.A.Williams, A test for differences between treatment means when 
several dose levels are comparedwith a zero dose control, Biometrics 27 (1971), with the permission of 
the author and the editor. 
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